Use of an Artificial Immune System Derived ... - ACS Publications

Aug 7, 2012 - Artificial immune systems are a class of artificial intelligence which have proven useful in many areas, including data mining, network ...
1 downloads 0 Views 1MB Size
Article pubs.acs.org/ac

Use of an Artificial Immune System Derived Method for the Charge State Assignment of Small-Molecule Mass Spectra David P. A. Kilgour,*,† C. Logan Mackay,‡ Patrick R. R. Langridge-Smith,‡ and Peter B. O’Connor† †

Department of Chemistry, University of Warwick, Coventry, CV4 7AL, U.K. SIRCAMS, School of Chemistry, University of Edinburgh, Edinburgh, EH9 3JJ, U.K.



ABSTRACT: Knowing the charge state of an ion in a mass spectrum is crucial to being able to assign a formula to it. For many small-molecule peaks in complex mass spectra, the intensities of the isotopic peaks are too low to allow the charge state to be calculated from isotopic spacings, which is the basis of the conventional method of determining the charge state of an ion. A novel artificial intelligence derived method for identifying the charge state of ions, in the absence of any isotopic information or a series of charge states, has been developed using an artificial immune system approach. This technique has been tested against synthetic and real data sets and has proven successful in identifying the majority of multiply charged ions, thereby significantly improving the peak assignment rate and confidence.

K

most common homologous series (e.g., CH2, H2, and O in the examples above) in the sample.1−3 This information can be used to generate an inference network, connecting those peaks in the mass spectrum which differ in mass by one of the base units of these identified homologous series. Then, if one member of a homologous series can be identified, for example by library matching, then the formulas of others can be inferred through the inference network. The algorithm we have developed for this task has been described previously.1 In summary, the detected peaks from a mass spectrum are searched against a library suitable for the analysis of that sample type. Those peaks which can be assigned elemental formulas on the basis of a unique hit from a library search are put into a category class “known”. In parallel, all the peak data is plotted in a mapping space, at coordinates corresponding to the Kendrick mass defect (KMD) of that mass against two different Kendrick bases, selected to be suitable for the visualization of that sample type. Once the data is plotted in this way, the mapping vectors between all pairs of points in this mapping space can be statistically analyzed to find common transformation vectors between points, in a manner similar to that of Kunenkov et al.3 A subset of the most common (perhaps the top 200−500) of these is converted back to the underlying mass difference, and these mass differences are searched against a library of formulaic differences. Pairs of points in the two-dimensional (2D) KMD mapping space that differ by common transformation vectors, that have been assigned to specific formulaic differences, are connected into an “inference network”. This network of connections is used to infer the formulas of unassigned peaks from those which have been assigned.

nowing the charge state of peaks in mass spectra is a crucial piece of evidence which is needed in order to assign a proposed formula for an ion, as the formula must be matched against the ion mass and not the mass-to-charge ratio. Conventionally, the charge state of the ion would be identified by use of the mass difference between the monoisotopologue peak and that of a stable isotope peak of the same ion. However, for many, indeed the majority, of small-molecule ions in complex mass spectra, the isotopic peaks can be below the signal-to-noise threshold and, therefore, may be impossible to reliably identify. This can greatly reduce the number of ions to which formulas can be assigned, particularly in positive mode spectra, as one can no longer be certain of the charge state, and higher charge states are more common. In a previous publication,1 we proposed some novel metrics of confidence which could be applied to improve the reliability and repeatability of automatic peak assignment algorithms in complex mass spectra. We have found that one of those metrics, the “stimulation level” metric, which was derived from the concepts used for a very basic form of artificial immune system, can be adapted to allow the charge state of an ion to be inferred. Complex Samples and Homologous Series. A common feature of many classes of complex, naturally occurring samples is that the processes responsible for the generation of the sample result in a complex network of interconnected homologous series. For example, it is widely known that many of the compounds in crude oil can be put into homologous series where each member differs from the next by the addition of CH2. Other homologous series examples which would interlink with the CH2 series could be those which result from the sequential loss of H2 (resulting from the formation of double bonds) or the addition of oxygen (potentially as a result of oxidative processes). Statistical analysis of the mass differences between peaks in a complex mass spectrum can be used to identify the base of the © 2012 American Chemical Society

Received: May 18, 2012 Accepted: August 7, 2012 Published: August 7, 2012 7436

dx.doi.org/10.1021/ac3013576 | Anal. Chem. 2012, 84, 7436−7439

Analytical Chemistry

Article

For example, consider two, closely separated ions taken from a genuine, positive ion mode mass spectrum of a malt whisky, at m/z 340.102043 and 340.108404. The heavier of these peaks is connected to 97 other peaks in the spectrum, through the standard inference network built from the mass defect differences of the top 176 most significant homologous series in the sample. The stimulation level of the same peak drops to 21 if the mapping vectors used to generate the inference network are halved in length, to produce the spacings that would be found for doubly charged ions; therefore, we assign a charge state of +1 to that peak as it is more stimulated by the +1 inference network. Conversely, the lower mass of these two example peaks is connected to 23 other peaks through the standard inference network, but this stimulation level increases to 75 for the doubly charged inference network; therefore, we assign a charge state of +2 to that ion. The formulas eventually assigned to these peaks, after charge state assignment, are [12C1313C1H20O8·Na]+ for the peak at m/z 340.108404 and [C31H38O14·2Na]2+ for the peak at m/z 340.102043.

However, it can also be used in a manner similar to an artificial immune system. Artificial Immune Systems and Stimulation Level. Artificial immune systems are a class of artificial intelligence which have proven useful in many areas, including data mining, network security, machine vision, pattern matching, bioinformatics, and anomaly detection.4−7 In the same way that artificial neural networks mimic some of the processes which may be important in the processing abilities of the brain, artificial immune systems mimic some of the control mechanisms thought to be important in the distributed intelligence of the mammalian immune system.7−10 A key part of the control mechanism in the immune system is provided by the B cells and specifically the degree to which they are stimulated.10 In part, B cell stimulation is moderated by the closeness of the match between the antigens expressed by that B cell and the pathogens which it encounters. The immune system uses that stimulation level as a trigger for a clonal selection process; insufficiently stimulated cells are culled from the immune system (by apoptosis), whereas B cells which are stimulated above a threshold are triggered to multiply and mutate, to better detect the pathogens. We previously proposed this concept as a method to derive a confidence metric for the assignment of formulas to mass spectral peaks. Taking each peak in turn, we treat it as a B cell and all other peaks in the spectrum as potential pathogens. The B cell can detect a pathogen if it is directly connected through the inference network. The presence of that connection varies depending on the accuracy of the mapping vector between those two peaks (how closely the mapping vector matches perfection for that formulaic difference); the accuracy of the acceptance threshold (analogous to the network affinity threshold10 in a conventional artificial immune system) can be set automatically or manually, by the user. The total stimulation of a B cell is the sum of all the connections that cell has to potential pathogens. In this way, we can record the stimulation level of all peaks in the spectrum. Detecting Multiply Charged Ions. In order to attempt to assign a charge state to all ions in the spectrum, we make two assumptions. First, as these samples comprise small molecules, singly charged ions will be predominant in the mass spectrum, and second, that multiply charged ions will be related to each other by the same mass (not mass-to-charge) differences/ defects which exist for singly charged ions. Therefore, the statistically dominant formulaic differences, which were used to identify the bases for the homologous series (pairs or series of chemicals whose formulas are separated by the same relative formulaic change) in the initial analysis, are likely to be those for singly charged ions, but these can also be assumed to be applicable to the series of multiply charged ions as well. We therefore make duplicates of the inference network, where the mapping vectors are remapped corresponding to the new charge state (the mass defect differences are, of course, related to the fact that ions are classed by their mass-to-charge ratio and not their mass alone).11 The algorithm then searches to find all ions which can be connected using the new, higher charge inference networks and determines the stimulation of all peaks based on these new potential connections. The charge state of the ion can be estimated by identifying which of the inference networks resulted in the highest stimulation for that ion, i.e., identifying the charge state as that which allows the ion to be connected to the largest number of other ions by known and assigned mapping vectors.



RESULTS This concept has been tested against both synthetic and real mass spectral data. A synthetic data set was generated using the first 65 fulvic acid homologous series presented in the supplementary data by Stenson et al.,12 which corresponds to a total of 1051 ions, once all the homologous series have been expanded out. An artificial mass spectrum was generated containing peaks from both singly and doubly charged versions of those fulvic acids, the singly charged ions being monosodiated and the double charged ions disodiated. Therefore, there are a total of 2102 ions in the synthetic mass spectrum and there are no isotope peaks. Plotted in 2D KMD space,1 the data set is shown in Figure 1.

Figure 1. Synthetic fulvic acids derived data set, containing both singly (solid squares) and doubly (hollow squares) charged ions, plotted in 2D KMD mapping space.

Prior to charge state assignment, the algorithm described previously assumes all ions are singly charged and returns unique formulas for 1174 ions out of the 2102 total number of ions, with 928 ions remaining unassigned. All the singly charged ions have been correctly identified, but 123 doubly charged ions have been misassigned with singly charged formulas, because 7437

dx.doi.org/10.1021/ac3013576 | Anal. Chem. 2012, 84, 7436−7439

Analytical Chemistry

Article

algorithm returns an assignment rate of 74% (2164 peaks out of 2516 peaks being assigned unique formulas) with a mass accuracy requirement of 200 ppb, a uniqueness threshold1 of 400 ppb, and the requirement that the inference network be 100% internally consistent. This assignment rate is some 5− 10% less than has been consistently recorded for negative mode mass spectra of malt whiskies using the same algorithm, which was the indicator that prompted the investigation into the presence of doubly charged ions. As a result of charge state assignment, 2203 peaks are most stimulated in (and hence assigned to) charge state +1 and 313 to charge state +2. Using the same formula assignment setup as for the undeconvolved spectrum, the assignment rate for the +1 peaks rises to 84%, but the assignment rate of the +2 peaks is only 33%. Further investigation, undertaken by adjusting the mass accuracy and uniqueness range of the assignment algorithm, reveals that the poor assignment rate of the deconvolved +2 peaks is a result of the fact that they suffer a systematic mass calibration error, as shown in Figure 3.

the mass of the double-charged, disodiated ion is sufficiently close in mass to a monosodiated ion candidate, in this case,