8 Computer Methods of Molecular Structure Elucidation from Unknown Mass Spectra IN KI MUN and FRED W. McLAFFERTY
Downloaded by UNIV LAVAL on July 11, 2016 | http://pubs.acs.org Publication Date: November 6, 1981 | doi: 10.1021/bk-1981-0173.ch008
Cornell University, Department of Chemistry, Ithaca, NY 14853
Mass spectrometry (MS) is now a well-accepted tool for the identification as well as quantitation of unknown compounds. The combination of MS with powerful separation methods such as gas chromatography (GC) or high-performance liquid chromatography (LC) provides a technique which is widely accepted for the identification of unknown components in complex mixtures from a wide variety of problems such as environmental pollutants, biological fluids, insect pheromones, chemotaxonomy, and synthetic fuels. The importance of such analyses has grown exponentially in the last few years; there are now well over a thousand GC/MS instruments in use around the world, most with dedicated computer systems which make possible the collection from each of hundreds of unknown mass spectra per day (1). One of the most serious limitations in the application of these powerful GC/MS and LC/MS systems is the accurate and e f f i cient identification of this flood of unknown mass spectra. A variety of computer-assisted techniques have been proposed (2, 3, 4), which can be classified generally as "retrieval" or "interpretive programs (2). The former matches the unknown mass spectrum against a data base of reference spectra; the ultimate limitation of this approach is the size of the data base, which currently contains the mass spectra of 33,000 different compounds (5^, 6), less than 1% of the number listed by Chemical Abstracts. If a satisfactorily-matching reference spectrum cannot be found by the retrieval program, an interpretive algorithm can be used to obtain partial or complete structure information, or to aid the human interpreter in this task (7-10). This paper will focus on interpretive algorithms; the types of programs proposed and in use will be compared in terms of their functions and applications. Priorities for possible improvements to these algorithms will be proposed with particular reference to the most imperative needs which we perceive. Also of importance are the new opportunities arising from the rapid improvements in capacity, a v a i l a b i l i t y , and cost of powerful computer resources.
0097-6156/ 81 /0173-0117$05.00/0 © 1981 American Chemical Society
Lykos and Shavitt; Supercomputers in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1981.
118
SUPERCOMPUTERS IN CHEMISTRY
Downloaded by UNIV LAVAL on July 11, 2016 | http://pubs.acs.org Publication Date: November 6, 1981 | doi: 10.1021/bk-1981-0173.ch008
Functions of an Interpretive System Interpretation should generally follow the paradigm of heur i s t i c search, "plan-generate-test" (11). Proposed algorithms differ in the emphasis placed on each of these steps; often one or more steps are l e f t to the human interpreter. In the context of mass spectral interpretation the "planning" phase can be defined (12) as translation of the available information (unknown mass spectrum, etc.) into structural features such as molecular weight, elemental composition, and specific substructures. "Generation" involves constructing a l l possible molecules consistent with these data (13). "Testing" utilizes methods for ranking these postulates, such as by predicting their mass spectra and comparing these to the unknown (12-16). Thus, in "testing" one attempts to convert structural data into spectral data, the opposite of "planning", in which the unknown spectral data is used to postulate structural information. Structural Information from Spectral Data. The kinds of i n formation that can be derived from an unknown mass spectrum by either human or computer examination include the identities of substructural parts of the molecule (parts that both should, and should not, be present), data concerning the size of the molecule (molecular weight, elemental composition), and the r e l i a b i l i t y of each of these postulations. In our opinion, the latter is much more c r i t i c a l for mass-spectral interpretive algorithms than those for techniques such as NMR and IR; the effect of a particular substructure on the mass spectrum is often dependent on other parts of the molecule, and a thorough understanding of these effects can only be achieved by studying the spectra of closely related molecules. Several algorithms have been proposed which attempt to duplicate the human interpretive process (2., _3, 4, 17, 18), coding known fragmentation pathways so that the program can deduce structural features. However, complex programs are required to interpret the spectra of simple molecules. Because synergistic effects of functional groups are ubiquitous in mass spectral decompositions, we see l i t t l e hope that useful programs can be designed for other than spectra of narrowly-defined structural classes. For similar reasons present programs of this type should not be significantly helpful for total unknowns not in the current reference f i l e (5, 6). However, almost any unknown w i l l contain some substructures which are well-represented in the reference f i l e . Because of this, useful substructural information can be obtained from retrieval program results, as the best-matching compounds often show structural features similar to those of the unknown. For example, SISCOM (which recognizes structural Similarity COding Multiple matching factors) provides such substructural information as well as a matching capability (4). "Pattern recognition" (19) is a
Lykos and Shavitt; Supercomputers in Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1981.
Downloaded by UNIV LAVAL on July 11, 2016 | http://pubs.acs.org Publication Date: November 6, 1981 | doi: 10.1021/bk-1981-0173.ch008
8.
MUN AND M c L A F F E R T Y
Molecular
Structure
Elucidation
119
powerful technique for such correlations, but cannot take advantage of known mass spectral behavior without pretraining of the algorithm, a disadvantage similar to the interpretive programs coding fragmentation rules (20). The Self-Training Interpretive and Retrieval System (STIRS)TD does not require pretraining; i t has been optimized for such feature recognition by defining 17 classes of mass spectral data particularly suited for the identification of different types of substructures. The data of the unknown mass spectrum corresponding to each of the pre-selected classes is matched against the corresponding data for a l l compounds in the reference f i l e ; i f a significant proportion of the 15 best matching compounds contain a specific structural feature, this is indicative of the presence of that structural feature in the unknown (7_). In a recent study (8) a l i s t of 589 substructures was selected on the basis of STIRS performance using 899 random unknowns; knowing this performance, the number of the 15 best-matching compounds containing each substructure can be used to calculate the probability that i t is present in the unknown. Table I shows an example of such results. A modification of the STIRS algorithm (9) makes possible the prediction of the molecular weight from an unknown mass spectrum with 91% r e l i a b i l i t y (95% for the f i r s t and second choices). This program has been extended recently to predict the elemental composition (10) with somewhat lower accuracy. An example of the i n formation supplied by STIRS is given in Table I. Generation of Possible Molecular Structures. The structural information derived from the spectral data can then be used to define a l l possible combinations which give logical molecular structures. By far the most comprehensive program available for this purpose is CONGEN, developed by the Stanford group (1,2, 13). The program is given the elemental composition of the unknown and l i s t s of structural features which are, and are not, present; from these i t generates a l l chemically-logical molecular structures consistent with these restrictions. Note that the program assumes that a l l these structures are correct, while the information from mass spectra is usually of