Factor Analysis of the Mass Spectra of Oligodeoxyribonucleotides Sir: Factor analysis ( I ) is a multivariate analysis technique that can be used to identify similar patterns of variation in data variables. Several applications of factor analytic techniques to the interpretation of mass spectral data have recently appeared in the chemical literature (2-5). The purpose of those studies was to determine which approaches would be most useful for studying mass spectral data and to apply these approaches to low resolution mass spectral data of organic compounds. Rozett and Petersen ( 2 4 )have studied the low resolution mass spectra of the 22 isomers of molecular formula C10H14. They have identified factors related to the structures of the isomers and have used these factors to classify the isomers according to structural similarities. Justice and Isenhour (5) have studied 630 low resolution mass spectra of compounds with composition C2-loH2-2zO~-4N~-2 and have identified factors related to the seven functional groups represented in the data set. We have used factor analytic techniques to study the low resolution mass spectra obtained from intact underivatized oligodeoxyribonucleotides. The purpose of these studies was to identify structural information that was contained in the mass spectra of these compounds (Table I) that would be useful for analysis of unknown compounds. That is, we would like to obtain information related to the presence or absence of a given nucleoside, the relative numbers of the different nucleosides, or the sequence of the nucleosides in a compound. Details of the data collection process and a description of the computer facilities used in these studies are given elsewhere (6, 7). The intensities of 32 ions from the mass spectra were used. These ions were selected based on previous mass spectral studies of nucleosides and nucleotides (6,8). The ions selected to be representative of the presence of each nucleoside are listed in Table 11. Since no calibration factor was obtained with the mass spectra and since the absolute intensities of the ions varied considerably from compound to compound, two relative measures of the ion intensities were used. They were: normalization of the individual ion intensities to the sum of the intensities for selected subsets of the ions, and ratios of ion intensities. Identification of Nucleosides Present. To determine the presence or absence of a particular nucleoside in a compound, the normalized-to-sum intensities of the ions representative of that nucleoside were used. A separate &-type (1,2) principal components analysis, using covariance about the origin as the dispersion matrix, was used to establish the presence or absence of each nucleoside in a compound. Each principal components analysis included all the oligonucleotides listed in Table I. For each analysis, only the eight ions (Table 11) derived from the nucleoside whose presence or absence was to be established were used. Figure 1 demonstrates the results obtained for the analysis of the adenosine ions. The first factor represents the presence of adenosine in a compound as indicated by the high loadings for all the compounds that contain adenosine. The second factor represents the absence of adenosine in a compound as indicated by the high loadings for the compounds that do not contain adenosine. Similar results were obtained from the analyses of the normalized-to-sum ions for the other three nucleosides. A &-type principal components analysis was also performed using all 32 ions (Table 11) normalized to their sum. Again, all the oligonucleotides were used and covariance about the origin was used as the dispersion matrix. Twenty factors were found, with the first four accounting for approximately 99% 1444
0
ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
Table I. Compounds Used in These Studiesa
a A represents adenosine, C represents cytidine, T represents thymidine, G represents guanosine, and p represents the 3’ to 5’ phosphate linkage between adjacent nucleosides.
Table 11. Selected Ions ( r n / e )Listed According to Nucleoside Adenosine Guanosine Thymidine 108 119 120 135 172 186 252 295
109 151 176 188 23 1 268 282 311
126 181 206 225 243 257 258 286
Cytidine 122 125 162 191 192 228 24 2 271
of the variance. Rotation of these four factors according to the varimax criterion (1, 2) indicated that each factor was representative of one of the four nucleosides. However, the loadings on each factor could not be used to reliably establish the presence or absence of the nucleosides represented by each factor in compounds that contained more than one type of nucleoside. The ability of the loadings on each factor to establish the presence or absence of the nucleosides generally decreased as the number of different types of nucleosides in a compound increased. This was because the normalized intensities of the ions for each nucleoside depended on the intensities of the ions for all the nucleosides in a compound when the ions for all the nucleosides were used. Determination of Nucleoside Ratios. Information as to the relative numbers of the different nucleosides in a compound should be contained in the ion intensities for each nucleoside. The best features to describe this information are ratios of the ion intensities for different nucleosides. Analyses using ratio features were performed to determine if this information was contained in the mass spectra. All the compounds in our data set that contained both adenosine and thymidine and all the compounds that contained both cytidine and guanosine had the same relative numbers of the two nucleosides. The compounds that contained the other pairs of nucleosides (adenosine-cytidine, thymidine-cytidine, adenosine-guanosine, and thymidineguanosine) had different relative numbers of the two nucleosides and were used in these analyses. Feature selection was performed by an R-type (1, 2) analysis, using covariance about the origin as the dispersion matrix, of the normalized-to-sum ions for each nucleoside for all compounds that did not contain that nucleoside. Those ions were retained that accounted for less than 1% of the total variance. Thus, those ions represented the nucleoside of interest with a minimal contribution from the other nucleosides. Ratios of the selected ions were formed for each pair of nucleosides with the ions for a given nucleoside always in either the numerator or denominator to keep the “sense” of the ratios the same. The ratios are listed in Table 111.
08"
00
0
0 @I
00
Flgure 1. Two-dimensionalplot of the loadings on the first two factors resulting from Q-analysisof the normalized-to-sum adenosine ions for compounds containing adenosine (class 1) and compounds not containicg adenosine (class 0)
Table 111. Ratios of Ions Selected by R-analysis Ratios of ions (rnle) Nucleoside pair Adenosine-cytidine 1721191, 1721192, 1721271,
Thymidine-cytidine Adenosine-guanosine
Thymidine-guanosine
1861191, 1861192, 1861271, 2521191, 2521192, 2521271, 2951191, 2951192, 2951271 2251191, 2251192, 2251271, 2861191, 2861192, 2861271 17 21188, 172123 1, 17 21268, 1721282, 1861188, 1861231, 1861268, 1861282, 2521188, 2521231, 2521268, 2521282, 2951188, 2951231, 2951268, 2951282 2251188, 225123 1, 2251268, 2251282, 2861188, 2861231, 2861268, 2861282
Q-type analyses of the compounds containing a given pair of nucleosides, using covariance about the origin for the dispersion matrix, for the ratios listed in Table I11 were performed. The relative loadings for the different compounds
on the principal factor resulting from each analysis are listed in Table IV according to the nucleoside pairs. The loadings for the compounds containing both adenosine and cytidine and the compounds containing thymidine and cytidine give a semiquantitative estimate of the relative numbers of the two nucleosides present in a compound. The results for the compounds containing guanosine were not as good. A possible reason for this is discussed below. Other feature selection methods were applied in an attempt to identify ratios that would yield better resulta. No other feature combinations were found that yielded better results. Sequence Analysis. Pattern recognition techniques have been successfully applied to the sequence analysis of dinucleotide isomers (6, 7). The features found useful for the sequence analysis of the dinucleotides were not useful for analysis of longer chain compounds. This was not suprising since the spectra of the oligonucleotides were much more complex because of the presence of more than two types of nucleosides and multiple residues of a given nucleoside. The factor analyses of the normalized-to-sum ions for each nucleoside indicated that any variation in the fragmentation patterns for the nucleosides, that were related to the position of the nucleoside in the compound, were less than the experimental variations in the data and could not be detected by factor analysis of the normalized-to-sum ions for each nucleoside. Thus, we were not able to obtain sequence information for compounds larger than dinucleotides.
CONCLUSIONS The factor analysis studies reported here have produced promising results. Although the presence or absence of a nucleoside can be determined by the presence or absence of a few of the most intense ions for that nucleoside, the capability of factor analysis for reliable computerized determination of the presence or absence of a nucleoside has been demonstrated. The factor analyses have also indicated that information related to the relative numbers of nucleosides in a compound may be present in the mass spectra of oligonucleotides. The fact that only a semiquantitative estimate of the relative numbers of nucleosides in a compound was obtained was probably due to the limited resolution of the data col-
Table IV. Relative Loadings on the Principal Factors Resulting from Q-analyses with the Ratios Listed in Table 111. The Loadings Are Relative t o the Average of the Loadings for the Compounds That Have a Nucleoside Ratio of 1 Adenosine-cytidine Thymidine-cy tidine Nucleoside Relative Nucleoside Relative Compound ratio (TIC) loading Compound ratio (A/C) loading 1 1 1 1 1 1 1 3 1 2 0.5
Thymidine-guanosine Nucleoside Compound ratio (T/G) PGPT 1 1 PTPG 1 PTPCPG APTPGPC 1 PAPTPGPCPAPT 2 PCPGPAPTPGPC 0.5
0.85 1.1 1.2 0.5 0.75 1.1 0.96 4.0 1.5 3.1 0.66
Relative loading 1.8 0.93 0.44 0.86 1.6 0.44
0.95 0.64 0.56 1.3 1.5 5.0 1.5 3.4 0.79
Adenosine-guanosine Nucleoside Compound ratio (AIG) PAPG 1 1 PGPA 2 PAPGPA (PAPGPA1 2 2 1 GPCPA 1 APTPGPC PAPTPGPCPAPT 2 PCPGPAPTPGPC 0.5
Relative loading 0.99 1.6 0.35 0.88 0.43 0.94 1.2 0.35
ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
1445
lection process. Visual inspection of the spectra for those compounds that deviated the most from the expected relationships indicated that these spectra were generally less intense than those obtained for the compounds that showed the expected relationships. This was most noticeable for guanosine because it yields the least intense spectra of the four nucleosides. Since adenosine and cytidine yield the most intense spectra, the results of analyses involving these nucleosides came closest to the expected results. While the results reported here cannot be considered conclusive because of the limited data set studied and the limited resolution of the data collection process, we feel that they are sufficiently encouraging to warrant further studies with a larger and better defined data set, and that we have demonstrated a useful approach for the factor analysis of complex data sets.
LITERATURE CITED (1) R. J. Rummei, “Applied Factor Analysis”, Northwestern University Press, Evanston, Ill., 1970. (2) R. W. Rozett and E. M. Petersen, Anal. Chem., 47, 1301 (1975).
R. W. Rozett and E. M. Petersen, Anal. Chem., 47, 2377 (1975). R. W. Rozett and E. M. Petersen, Anal. Chem., 48, 817 (1976). J. 8.Justice, Jr., and T. L. Isenhour, Anal. Chem., 47, 2286 (1975). J. L. Wiebers and J. A. Shaplro, Biochemistry, 16, 1044 (1977). D. R. Burgard, S. P. Perone, and J. L. Wiebers, Blochemisfry, 18, 1051
.- . .
i,i w, n,
(8) J. A. McCloskey In “Basic Principles in Nuclelc Acid Chemistry”, Vol. I, P.O.P. Ts’o, Ed., Academic Press, New York, N.Y., 1974, Chapter 3.
D. R. Burgard S. P. Perone* Department of Chemistry Purdue University West Lafayette, Indiana 47907
J. L. Wiebers Department of Biological Sciences Purdue University West Lafayette, Indiana 47907 RECEIVED for review April 25,1977. Accepted June 6,1977. This work supported by NSF grant No. MPS74-12762, the Office of Naval Research, and NSF grant No. PCM76-21554.
Atomic Spectrochemical Measurements with a Fourier Transform Spectrometer Sir: In a preliminary report, we discussed the potential application of Fourier transform spectrochemical instrumentation to atomic spectral measurements (1). Since that time we have further developed our experimental system in order to improve our measurement capability in the visible and ultraviolet spectral regions. The Michelson interferometer that we have designed and built for application to Fourier transform spectroscopy has three optical inputs; a He-Ne laser, a white light (tungsten bulb) source, and the spectral signal of interest. Laser fringe referencing is used to sequence digitization and to control the velocity of the moving mirror using a phase-locked loop. The mirror drive system consists of an electromechanically driven mirror supported by an air bearing. The mirror movement is repetitive and both the scan rate and length can easily be set or altered. With control signals derived from the white light interferogram and the laser fringes, signal interferograms can be precisely time averaged. This system uses a unique “pretrigger” approach that allows acquisition of any desired number of signal interferogram data points both before and after the zero path difference position. We did not have this last feature when the data were acquired for Ref. 1. This meant that only “one sided” interferograms could be acquired with no points acquired before the central fringe of the signal interferogram. This made it essentially impossible to reliably phase correct spectra. With our present ability to acquire full double sided interferograms,i.e., an equal number of points on either side of the eero path difference position, excellent spectra can be calculated without using any phase correction procedures, i.e., by calculating the amplitude spectrum. As will be seen, this makes a considerable improvement in the quality of the spectra that can be measured with our system. In addition the interferometer is now interfaced to a larger computer (PDP 11/10) which is presently capable of 4 k transforms (floating point FORTRAN). This has allowed up to more fully use the designed resolution capability of the interferometer. Our previous maximum transform size was 812 points. The flame (Air-CzHz)emission spectrum of Li, K, Rb, and Cs as measured using our Fourier transform spectrometer with 1446
ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977
four different sampling rates is shown in Figure 1. A Si photodiode was used as the detector. These spectra were calculated from 512-point double-sided interferograms (Le,, 256 points each side of zero path difference) and 50 scans were taken of each interferogram. The solution used to obtain the spectra shown in Figure 1 contained about 250 ppm of each element. In addition to illustrating the marked improvement in spectrum quality arising from the acquisition and processing of double-sided interferograms, this series of spectra also provides an excellent illustration of the utility of aliasing in providing optimized resolution and spectral coverage with a limited transform size. Since we use the standard He-Ne reference laser, the basic sampling interval for the interferometer system is 0.6328 pm. This means that the shortest wavelength of light that can be properly sampled without aliasing is 1.266 pm (7901cm-l). In order to measure the major emission lines of Cs, Rb, K, and Li in the near-IR (See Table IV of Ref. 1)without aliasing, a bandwidth of 15 803 cm-l is necessary. This can be achieved with our measurement system by frequency doubling the laser fringe signal using phaselocked loop techniques ( 2 )to provide a sampling interval of 0.3164 pm. However, with a 512-point transform limit the potassium and rubidium doublets cannot be resolved. This is shown in Figure la. The numbers (6,7,8) shown below the spectral peaks refer to spectral regions associated with the i-4 sampling rate (see Table I11 of Ref. 1). The arrqws indicate the direction of increasing wavenumbers. Using a 0.6328-pm sampling interval (Figure lb), resolution is doubled and the spectral information is aliased. With a sampling interval of 1.266 pm (Figure IC),the potassium doublet is resolved and the cesium lines have aliased into the potassium-rubidium region and in the last spectrum (Figure Id) measured with the 2.532-pm sampling interval, all lines are well resolved and all three regions overlap via aliasing. It is clear from this series of spectra that aliasing can be used to advantage in optimizing spectral coverage and resolution when making Fourier transform spectrochemical measurements.