Article pubs.acs.org/ac
Hyperspectral Visualization of Mass Spectrometry Imaging Data Judith M. Fonville,†,# Claire L. Carter,‡ Luis Pizarro,§,∥ Rory T. Steven,‡ Andrew D. Palmer,‡ Rian L. Griffiths,‡ Patricia F. Lalor,⊥ John C. Lindon,† Jeremy K. Nicholson,† Elaine Holmes,*,† and Josephine Bunch*,‡ †
Biomolecular Medicine, Department of Surgery and Cancer, Imperial College London, South Kensington, London SW7 2AZ, United Kingdom ‡ School of Chemistry, University of Birmingham, Edgbaston, Birmingham B15 2TT, United Kingdom § Department of Computing, Imperial College London, South Kensington, London SW7 2AZ, United Kingdom ∥ Escuela de Ingeniería Informática, Facultad de Ingeniería, Universidad Diego Portales, Av. Ejército 441, Santiago, Chile ⊥ Centre for Liver Research and NIHR Biomedical Research Unit, School of Immunity and Infection, University of Birmingham, Edgbaston, Birmingham B15 2TT, United Kingdom S Supporting Information *
ABSTRACT: The acquisition of localized molecular spectra with mass spectrometry imaging (MSI) has a great, but as yet not fully realized, potential for biomedical diagnostics and research. The methodology generates a series of mass spectra from discrete sample locations, which is often analyzed by visually interpreting specifically selected images of individual masses. We developed an intuitive color-coding scheme based on hyperspectral imaging methods to generate a single overview image of this complex data set. The image color-coding is based on spectral characteristics, such that pixels with similar molecular profiles are displayed with similar colors. This visualization strategy was applied to results of principal component analysis, self-organizing maps and t-distributed stochastic neighbor embedding. Our approach for MSI data analysis, combining automated data processing, modeling and display, is user-friendly and allows both the spatial and molecular information to be visualized intuitively and effectively.
M
complexity and the number of isobaric species detected.13,14 Normalization of the spectra can reduce some of these systematic experimental variations, for example, varying levels of signal intensity as a result of inhomogeneous matrix deposition, differential efficiency of analyte incorporation into the matrix crystals, and crystal inhomogeneity across the sample.15−17 Normalization may also correct for differential ionization efficiencies per region but not for other inherent limitations of the MALDI technique and sample preparation protocols. These effects include ionization efficiency differences between molecules, differential ion suppression and adduct formation because of varying sample composition, and varying levels of analyte solubility and extraction from different tissue regions.18 These aspects of MALDI data are intrinsic to the current methodology and cannot be corrected for computationally. However, despite these drawbacks, MALDI MSI is currently the most widely used technique for the spatial molecular analysis of biological samples, demanding the development of data processing and analysis tools to optimally deal with the data, and maximize the information extracted from these massive data sets.19
ost biochemical analyses of tissue samples are concerned with measuring global tissue concentrations and disregard molecular distribution. However, spatial molecular information is of paramount importance in biomedical research for understanding pathogenesis and disease progression. A successful approach to obtain localized biomolecular information is mass spectrometry imaging (MSI), acquiring mass spectra for different positions on the sample,1,2 see Figure 1. Methods that are frequently used for characterizing the spatial distribution of biomolecules in tissues include matrix-assisted laser desorption/ ionization (MALDI)3 and desorption electrospray ionization.4 Secondary ion mass spectrometry is a well-established methodology for elemental and small molecule analyses that is increasingly used to study biological objects.5,6 MSI data can be interpreted either as a full mass spectrum at a given spatial point (pixel), or as an image of a specific ion’s intensity over a two-dimensional set of pixels. MALDI MSI has been successfully used in protein, lipid, and metabolite profiling studies on a range of tissue types, from animal models as well as human surgical samples.7−11 MALDI spectra reflect both biological variation, typically the property of interest, as well as variation due to experimental and instrumental sources. Sample preparation for MALDI is therefore of critical importance12 and thus remains an area of ongoing research, an example of which is desalting methodology, recently applied in lipid analysis to reduce spectral © 2012 American Chemical Society
Received: August 16, 2012 Accepted: December 18, 2012 Published: December 18, 2012 1415
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
Figure 1. Mass spectrometry imaging experiment and data analysis. A tissue sample is obtained by cryo-sectioning and mounted on a plate. A matrix solution is applied for matrix-assisted laser desorption/ionization (MALDI), prior to mass spectrometry imaging (MSI) data acquisition. MSI is typically set up to acquire a mass spectrum for each position along a grid, and the resulting data set consists of x by y pixels and n m/z values (spectral data points or bins). Thus, the MSI data are represented by a three-dimensional data array of size x × y × n, and the values in this data array represent the peak intensity. Traditionally, these data are evaluated with univariate methods: either by comparing the mass spectral profiles of pixels in selected regions in the image, or by visualizing each or a selection of the n individual m/z images. The univariate approaches are contrasted by multivariate methods, where the data set is unfolded into an (x × y) × n table: the two directional dimensions of the MSI data are combined to form a large table where each row is the mass spectrum at a given (x, y) position and each column one m/z variable. On this unfolded data set, multivariate data analyses such as principal component analysis (PCA), self-organized maps (SOM), and t-distributed stochastic neighbor embedding (t-SNE) can be performed, while retaining the information of the (x, y) location of each mass spectrum to reconstruct images.
squares analysis,23 support vector machines,7,24 random forests,20 and clustering algorithms.21,25−27 Some of these approaches need data compression by PCA or apply a variable selection step prior to multivariate analysis to reduce the dimensionality of the data, in order to achieve good modeling results and prevent overfitting of the data.21,24−26 However, classification and segmentation methods are not designed to represent a biologically relevant continuum characterized by a range of slightly different molecular profiles. Similarly, linear methods such as the widely used PCA typically do not allow for a visualization of nonlinear and more subtle effects in the data. We propose a methodology that fully integrates the multivariate and image nature of MSI data without linearity constraints or preselection of “interesting” m/z values or pixels. We apply a nontargeted approach on the complete, processed data set rather than a small selection of sample regions and m/z values, to visualize the data set effectively. Hyperspectral modeling is combined with intelligent color-coding to create an intuitive image display that summarizes the MSI data features. We demonstrate this approach for three data modeling methods: (1) principal component analysis, a linear modeling method frequently employed in MALDI MSI analysis; (2) self-organizing maps,28 a type of neural network; and (3) t-distributed stochastic neighbor embedding,29 a neural network-based manifold learning technique30−32 for hyperspectral data analysis that has not been previously used in the analysis of MSI data. The combination of automated data processing,15 modeling and visualization fully exploits the genuine power of MALDI MSI, by providing a sophisticated and unbiased overview of the data.
MALDI MSI and related imaging methods are promising molecular profiling techniques, because the generated data sets contain both spectral (compositional) and spatial biological information. However, the potential of MSI is not fully realized by the majority of analyses performed: data are typically interpreted by depicting images of individual mass spectral peak (m/z) intensities or by evaluating mass spectra for selected regions in a profiling manner (Figure 1). The main body of literature is concerned with the localization of predetermined compounds using prior knowledge of tissue and disease biology, which in effect disregards and discards the majority of the data by considering only a single or limited number of m/z images. Thousands of images have to be analyzed if one wants to thoroughly evaluate the data set one m/z value at a time, which is not only time- and labor-intensive, but also highly likely to miss associations between different m/z values, because the multivariate aspect of the data is ignored. To capture more accurately the full complexity of biological tissue and to prevent the overinterpretation of artifacts, it is best to adopt unbiased analytical tools of sufficient sophistication to accommodate such high-dimensional data.20,21 The intrinsic limitations and difficulties of data extraction and visualization for the univariate approaches described above can be overcome by performing multivariate statistical analyses. Typically, the MSI data set is reshaped into a two-way table, where each row corresponds to a different pixel and each column reflects an m/z variable (Figure 1). Various multivariate analysis tools have already been used to investigate the MSI data, including principal component analysis (PCA),21,22 partial least1416
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
Figure 2. RGB color-coding of hyperspectral modeling results. (a) Schematic of the anatomy for the rat brain that was subjected to MALDI MSI after formalin fixation; scale bar = 2 mm. Key: CB = cerebellum; CC = corpus callosum; CTX = cerebral cortex; DCN = deep cerebellar nuclei; F = fornix; HP = hippocampus; HY = hypothalamus; M = medulla; MD = midbrain; OC = optic chiasm; P = pons; PG = pituitary gland; S = septum; TH = thalamus; 3V = third ventricle; 4V = fourth ventricle. (b) Three randomly chosen single m/z images (m/z 791.4, 839.6, and 865.6). (c) An overlay of the three images in b is shown, through combining the individual red, green, and blue intensities for each pixel as additive colors (white pixels consist of high levels of red, green, and blue). (d) PCA space: the location of a pixel (each pixel is represented by a dot) on principal component 1 (PC 1), PC 2 and PC 3 determines the intensity for red, green and blue (RGB), respectively. (e) The pixels contained in the box in the PCA scores plot in d are shown in color. (f) The image after color-coding the pixels with the RGB-scheme shown in d. (g) SOM space: a unique color for each SOM unit is assigned with red, green and blue representing the location along the three dimensions of the 3D SOM map (20 × 10 × 5). (h) The pixels that were mapped in the 3 × 3 × 1 square section of the SOM map highlighted in g can be seen in the original image with the same color-coding. (i) The complete image with SOM-based RGB color-coding. (j) t-SNE space: the scatter plot of pixels in the t-SNE model shows clear clustering patterns, and pixels are RGB color-coded based on their positions on the three axes. (k) The cluster selected with the box in j is shown as colored pixels in the image. (l) The image after color-coding the pixels with RGB values determined by the t-SNE manifold learning method.
■
Results will be presented for a range of biological samples to illustrate its generic applicability, and we demonstrate how our approach can be routinely applied to existing and legacy data sets, including formalin fixed samples and samples acquired with fast MSI methodology.33−35
MATERIALS AND METHODS
Data Sets. For the formalin-fixed rat brain sample, the data acquisition and processing have been previously described;15,36 the experimental and processing details for the consecutive liver 1417
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
from the validation sections in the PCA model were calculated from the coefficients generated for the modeling section and RGB coded as before. The best matching unit in the SOM model for the modeling section was found for each pixel from validation sections, and used to color-code each pixel in the validation sections as before. For t-SNE, prediction was performed by finding for each pixel from validation sections the most similar pixel in the modeling section (which was defined as minimal sum of squares of the differences from the molecular profiles after data processing). Note that no information on pixel position is used. The color of this most similar pixel in the modeling section was used in the visualization of the pixels in the validation sections (alternatively, one could use the out-of-sample extension available for parametric t-SNE40).
sections of a cirrhotic liver sample from a patient with nonalcoholic steatohepatitis (NASH) are described in the Supporting Information, and an example processed data set (the formalin-fixed rat brain sample) is also included in Supporting Information. Data Modeling. Manifold learning approaches attempt to embed high-dimensional data in a new, low-dimensional space that is as rich and concise as possible.30 This versatile class of methods is well-suited for data visualization and investigation of nonlinear relations in the data. We used the following three methods to map the data: (1) principal component analysis; (2) the self-organizing map (calculations were performed with the SOM toolbox for Matlab developed by Vesanto et al.37 using the default settings for a 3D SOM of size 20 × 10 × 5 with rectangular units); (3) t-distributed stochastic neighbor embedding (t-SNE29). The t-SNE mapping to three dimensions was done using the default settings of the toolbox for dimensionality reduction and methodology described by van der Maaten et al.29 (http://homepage.tudelft.nl/19j49/Home. html). All these data mapping methods are unsupervised, which means they are not biased by prior assumptions about the importance of individual m/z values or pixels, and were performed in MATLAB. The analyses were performed on the processed data, but can equally easily be done on the raw data set. The Supporting Information contains a flowchart of the methodology and example MATLAB code for some of these data modeling methods. Data Visualization. Unlike the approach typically taken for hyperspectral imaging, we use the spectra rather than the images as observations, such that the axis system shows the (dis)similarity of different pixels. For PCA, the score value for each pixel on each of the three first principal components was translated to RGB color-coding of that pixel by varying the red, green and blue intensity linearly on the three independent axes,22 such that the minimum value on the axis is represented by a color intensity of 0, and the maximum value on the axis has intensity 1 (on a scale of 0 to 1), see Figure 2d. For SOM, the units in the map were colored by RGB for their position as illustrated in Figure 2g again with the red, green and blue intensities varying linearly between 0.1 and 0.9 (on a scale of 0 to 1) for the minimum and maximum location on the map. Each pixel was mapped onto the neuron with the most similar weight vector (as measured by Euclidian distance), and was assigned the color of this best matching unit in the 3D SOM.38,39 The locations of pixels on the three t-SNE axes were used for RGB color-coding, similar to the situation for PCA, where the red, green and blue intensity were adjusted linearly between 0 and 1 (on a scale of 0 to 1) for the minimum and maximum value on that t-SNE axis, respectively, see Figure 2j. Example code showing the color-coding of the multivariate modeling results is provided in the Supporting Information. The liver data set consisted of four consecutive sections: one was used for histology, one to create PCA, SOM, and t-SNE models based on the MSI data collected for this modeling section. The remaining two sections also resulted in MSI data sets, which were used for validation. The validation sections were processed by selecting those peaks that were retained for the section used to build the multivariate models, and pixel selection with threshold −0.5 for each data set followed by normalization to the median peak intensity of each pixel. For each data set independently, the data were log-transformed and meancentered as described in Fonville et al.15 The scores of pixels
■
RESULTS AND DISCUSSION Visualization of MSI Data Modeling Results with RGB Color-Coding. Traditionally, MALDI-based MS imaging data (e.g., from a formalin-fixed rat brain tissue,36 Figure 2a) are represented by single m/z images, where a color-scale reflects the intensity of the m/z bin for each pixel: Figure 2b shows images for three m/z values. Simultaneous evaluation of a selection of m/ z values is facilitated by overlaying images, where a different color represents each m/z value,10 Figure 2c. These approaches, although informative, only allow visualization of a small subset of the available data and fail to exploit the true power of MSI methodology. We propose to evaluate MSI data with multivariate methods that map nonlinear relations among m/z values onto a model space with reduced dimensionality, to characterize the different tissue structures in a rich and concise manner. To enable intuitive interpretation of these mapped results, an additive red−green− blue (RGB) color scheme is developed for visualization, where the location of a pixel in the model space determines its color in the image. The preservation of the organization of the pixels in the new axis system of the model combined with this colorcoding scheme enables an intuitive interpretation of the resulting images, where similar colors (i.e., similar RGB values) represent more closely related spectra and hence reflect similar biochemical profiles. PCA creates composite linear axes (loadings) that are ordered according to axes of variance. The scores are the positions of pixels on a principal component (PC) and the score value reflects the molecular content as measured in MSI: pixels with similar molecular profiles will have similar scores. The score value of each pixel on the first PC (PC 1) determines the intensity of the red channel for that pixel in the image: pixels with similar scores on PC 1 will have similar redness (although the green and blue coloring for the pixels and thus the total pixel appearance may differ). Similarly, the score values on the second and third PC determine the green and blue intensity of each pixel. Thus, the created image displays the PCA mapping results through color, such that pixels with similar mass spectra have similar colors. This intelligent color-coding of an image based on multivariate modeling results enables biological and anatomical interpretation of the MSI data set with a single image (Figure 2d−f).22 The variation on PC 1, shown in red in Figure 2f, corresponds to the differences between gray and white matter in the brain. In addition, the variation on PC 2 and 3 differentiates, for example, the center of the hypothalamus (pink) from the cerebellum (orange), regions that were not differentiated from each other in the coloring shown in Figure 2c, as a result of the use of all available m/z images, rather than, for example, three. 1418
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
Figure 3. Validation of the methodology through prediction of consecutive liver sections. A snap-frozen cirrhotic human liver sample from a patient with nonalcoholic steatohepatitis (NASH) was cryo-sectioned into 5 μm thick consecutive sections, three of which were used for MSI and the fourth (a) was stained using standard Hematoxilin and Eosin (H&E) stain; scale bar = 1 mm. A schematic of the tissue is illustrated in (b) where fibrotic septae composed of extracellular matrix material (1) separate regenerative nodules of hepatocytes (arrows). Multivariate modeling was performed on MSI data acquired for the first section (the top row above the red division line). This model was used to color-code the pixels of the MSI data for the second and third sections according to their molecular profiles (middle and bottom): (c) PCA model results, (d) SOM results, and (e) t-SNE results. The black pixels in the middle section correspond to data for which the median of the signal was zero (see Fonville et al.15) and were therefore excluded from analysis: these pixels correspond to the region of a tear in the tissue.
these clusters correspond to defined anatomical features (Figure 2j). RGB color-coding of pixels, linearly based on their position in the t-SNE axis system, translates to a contrasting coloring for the different anatomic regions in the image, such as a crisp display of the septum in pink and the pituitary gland in peach (Figure 2j−l). Our approach of visualizing MSI data through RGB colors defined by multivariate modeling represents dissimilarities of molecular profiles by different colors, as the color reflects the positioning of pixels in a lower-dimensional model space. This results in a continuous color-scheme, which avoids the need for discrete interpretation such as in segmentation analysis, and the associated difficulty of choosing the required number of classes into which to segment the data.25,26 We have demonstrated our visualization strategy using three conceptually different mapping methodologies. The recently developed t-SNE shows excellent visualization of the different tissue structures. This results from its ability to preserve local structure for nonlinear data with a small number of latent variables29 and is thus ideally suited for our three-dimensional model and RGB encoding. The proposed visualization framework is not limited to PCA, SOM, and t-SNE: results with kernelPCA, ISOMAP, and SAMMON methods are shown in Supporting Information Figure S-1. PCA, SOM, and t-SNE Modeling Results for a Range of Biological MSI Data Sets. The anatomical and biological relevance of small molecule MALDI MSI data and the universal applicability and strength of MSI data visualization through color-coding of multivariate modeling results was demonstrated for various tissue types of different species (including model animal tissue and surgical human liver tissue), different preparation methods (formalin-fixed and fresh tissue) and for data from different MALDI MSI lasers and techniques. All data were visualized in a wholly unbiased manner, as no user interaction is needed for the modeling and color-coding.
The self-organizing map (SOM, initially developed by Kohonen et al.28) is a type of unsupervised artificial neural network that maps high-dimensional data to a lower dimensional space in a topology-preserving manner. A SOM consists of a set of units, ordered in an array, where each unit has an associated weight vector.28 These units are organized such that the molecular profiles in each weight vector are more similar for neighboring units than for distant units. An MSI pixel is mapped on the SOM by finding its best matching unit: the SOM unit whose weight vector is most similar to the measured mass spectral profile (based on Euclidean distance).41 Similar pixels will be mapped on the same unit, and neighboring units will map closely related pixels, a direct result of SOM’s self-organized character. Therefore, as with PCA, the positions of pixels in the model space, the SOM map, can be used as a measure of molecular similarity. For each unit in the SOM map, the level of red, green, and blue is linearly dependent on its position along the three dimensions of the map (Figure 2g); this color-coding process acknowledges the self-organized character of the map as neighboring units have similar colors. The best matching unit of a pixel determines the color for that pixel in the image (Figure 2g− i).38,39 Similar images are obtained for other SOM geometries (Supporting Information Figure S-1). From Figure 2i, it is clear that, although the cerebral cortex and cerebellar cortex are both gray matter, their molecular compositions differ: the mass spectral profiles for these regions are mapped to different regions of the SOM and, thus, are shown in different colors. A similar procedure is applied to visualize the MSI data mapped with t-distributed stochastic neighbor embedding (tSNE).29 This dimensionality reduction algorithm creates an axis system that preserves both the global and the local structure of the high-dimensional data, and is particularly well-suited to visualize complex data when the required model space is small, e.g. only three dimensions.29 When MSI data are modeled with tSNE, the pixels form tight and dispersed sets of clusters, and 1419
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
Figure 4. Exemplar molecular profiles for the MSI data of the formalin-fixed control rat brain shown in Figure 2. (a) The scores of pixels on PC 1 in the PCA model (identical to Figure 2d, but viewed from a different angle) determine their redness. (b) Image based on the color-coding shown in a. (c) Loadings for PC 1 in the PCA model: peaks more bright red have larger positive weights on PC 1, more bright blue have a negative weight on PC 1; the height of the peak corresponds to the average signal intensity of the normalized data (before log-transformation and mean-centering). (d) The units of the 20 × 10 × 5 SOM are RGB color-coded, and a 3 × 3 × 1 region is highlighted. (e) Pixels for which the best matching unit is in the 3 × 3 × 1 area highlighted in d are shown in color, indicating part of the corpus callosum. (f) Median of the 9 weight vectors corresponding to the SOM units highlighted in d: negative weights are shown blue, positive weights in red, as projected on the average signal intensity as in c. (g) The t-SNE model space (identical to Figure 2j, but viewed from a different angle). (h) Pixels inside the box in g define an anatomical feature: the hippocampus. (i) Median spectral profile of the MSI data from the pixels highlighted in g, where peaks in red are relatively high in hippocampal pixels and blue peaks are relatively low.
regenerative nodules of hepatocytes. The classic lobular architecture of the sample is disrupted due to the fibrotic nature of the liver section and thus expanded portal triads are evident within the fibrotic septae, which separate regenerative nodules of hepatocytes. In Figure 3c−e, the top section was used to build the PCA, SOM, and t-SNE models. The data from the second and third sections, unused in the model making, were fitted with the multivariate models developed for the first section. The PCA scores and SOM best matching units were calculated to determine the pixel colors for the subsequent two sections. For t-SNE, each pixel in the second and third sections was assigned the color of the spectrally most similar pixel in the first section. When using the PCA, SOM, and t-SNE models to color-code unseen data from two neighboring liver sections, the biological patterns observed in the first section and H&E stained section are reproduced, validating the robustness of the modeling and colorcoding approach. The biological relevance of this display of MSI data by multivariate modeling and color-coding is particularly well exemplified by the results shown in Figure 3. The results show clear separation of pixels from acellular regions of fibrotic matrix deposition (shaded area in Figure 3b) and areas consisting of cellular material such as the hepatocyte lobules (white areas). Most importantly, we can visualize the different cell populations based on their MSI profiles, and it is even possible to discriminate subpopulations of the same cell type within a single individual based on the differential MSI profiles: we see altered profiles associated with zonation within regenerative hepatocyte nodules (arrows in Figure 3b). For example, there is a clear variation in nodule colors in Figure 3d and e from the periphery to the center
Supporting Information Figure S-2 shows results of the different multivariate modeling methodologies as applied to: a rat brain data set acquired with fast MSI methodology, a mouse brain data set, and a whole body section of a rat. For these analyses, nonlinear methods and in particular t-SNE showed strong differentiation of well-known anatomical features of the brain and internal organs. Thus, the degree of molecular similarity of different pixels as measured with MSI is shown in a single image, providing an overview of the differences in underlying biochemistry and tissue structure of the different tissue regions. When comparing results of the different multivariate models, it is clear that PCA, as expected, strongly represents overall variation in the MSI data: in Figure 2, PC 1 governs the redness of the pixels, and is differentiating between gray and white matter. In contrast, SOM and t-SNE are additionally capable of picking up more subtle anatomical differences, and in Figure 2 t-SNE represents the anatomy of the rat brain particularly well: the anatomical substructures of the brain are clearly distinguishable (Figure 2a,l). These results suggest that nonlinear modeling methods are highly appropriate for MSI data analysis, and emphasize the use of t-SNE, a method that has, to date, not been applied in the analysis of MSI data. Validation of the Methodology through Analysis of Consecutive Cirrhotic Human Liver Sections. To validate our methodology and ascertain that the resulting images were artifact free, four consecutive serial sections of a cirrhotic human liver were analyzed, three of which were subjected to MALDI MSI and the fourth section was used for hematoxylin and eosin (H&E) staining (Figure 3a). Figure 3b shows a schematic of the tissue structure, highlighting areas of fibrotic matrix and 1420
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
of the feature labeled “2” in Figure 3b. The t-SNE results (Figure 3e) also illustrate the intrahepatic heterogeneity of regenerative hepatocyte nodules in different locations within the liver parenchyma, with particularly bright signatures arising in the nodules to the right of the figure panels. Such features are hard to assess using standard immunohistochemical techniques, and while such heterogeneity can reflect nonuniformity of vascular or biliary supply, the potential for early detection of local dysplastic change merits further investigation. We can also clearly detect differential mass spectral profiles within nonparenchymal cell compartments, for example the areas associated with large vessels within the tissue (e.g., large feature indicated by arrows in Figure 3a and c−e). Multivariate modeling of MSI data confers fundamental advantages of digital staining20 compared to traditional histology images in terms of biological understanding. Traditional histology allows pathological investigations at a cellular level by studying anatomical abnormalities. This high-resolution evaluation of tissue is complemented by the detailed molecular nature of MSI, which reveals lateral distribution of chemical information within the sample. Although immunohistochemistry allows the study of specific biomarkers, it requires prior knowledge of these markers to develop antibodies against them, and can only investigate a single biomarker per section. In contrast, a single MSI experiment can extract a wealth of chemical information from a single section. Visualizing the Molecular Profiles. An important feature of multivariate modeling is the ability to interpret the molecular profiles: the models use weighted contributions of all mass signals as signatures, which are indicative of different anatomical or pathological features that culminated in the image visualizations. One can use the loadings of PC 1 to find the molecular patterns differentiating white and gray matter in the PCA model (Figure 4a−c). The m/z bins for which high loadings are found in PC 1, reflecting the molecular differences between the white and gray matter tissues as measured by MSI, indicates for example a relatively higher prevalence (intense blue loadings color in Figure 4c) of m/z 835.6 bin in the gray matter, putatively identified as representing a sphingomyelin: SM(d18:1/24:1) + Na.11,36,42 For SOMs, the spectral weights of the units can be shown and combined to characterize the pixels for which these are the best matching units: Figure 4d−f shows a composite weight vector for pixels in the corpus callosum, which indicates, for example, higher levels of m/z 788.4 in this region, and may correspond to a phosphatidylcholine: PC(36:1) + H.11,36,42−45 The molecular profiles of the clear clusters of pixels established by t-SNE can be displayed by, for example, the mean or median of their spectra (Figure 4g−i). From the molecular profile shown in Figure 4i, it can be seen that the hippocampus depicts high levels of m/z 804.4, provisionally identified as PC(36:4) + Na.11,36,42,44,45 The mapping of pixels in the model space can thus be used not only to analyze tissue microanatomy based on molecular composition but also to interpret the molecular masses for biochemical characterization. Information on the mass spectral profiles for tissue regions and their differences is, as in any -omics investigation, especially useful for the definition of biomarkers and the improved understanding of biochemistry of healthy tissue, in pathogenesis and in multiple disease states. Naturally, these results have to be considered within the limits of binned data, and the chemistry intrinsic to the MALDI MSI technique, as well as, for example, ion suppression as a result of salt presence and differences in ionization efficiency. Interestingly, one could actually exploit the methodology to better understand these
interesting and complicated mechanisms, for example, by studying two otherwise identical tissue samples and subjecting them to different treatment or sample preparation protocols, to quantify and interpret the resulting differences in observed mass spectral profiles. The interpretation of the untargeted molecular profiles that our methodology offers, provides a basis for further detailed data analysis: characteristic peaks can be subjected to MSn and other structural and biochemical methodologies as a route toward biomarker identification and validation for mechanistic investigations into specific pathologies.
■
CONCLUSIONS The acquisition of spectra using MALDI MSI is analogous to -omics research, where instead of purifying, identifying and quantifying each individual constituent, a characteristic fingerprint of the sample composition is generated with a powerful analytical method. As a result of combining the imaging and the -omics approach, the information density of MSI data is enormous: the data set reports the chemical composition of each pixel with mass-spectrometry based molecular fingerprints. Preferably, the analysis of an MSI data set would address both the spatial and spectroscopic aspects of the data. However, the predominant approach to data analysis in the MSI literature only takes one property (either spatial or molecular) at a time into account, and disregards a large part of the carefully acquired data: these, often visual, analyses are biased toward existing biological knowledge and prior assumptions, and peaks with high intensity. The large number of m/z variables makes hyperspectral imaging tools ultimately suited for the modeling, analysis, and visualization of MSI data, by providing a more thorough and reliable overview of the processed MSI data without discarding data or using prior information. Using dimensionality reduction and model visualization with RGB color-coding, we create intuitive displays and enable straightforward MSI data interpretation: by displaying both the spectral data (colorcoded) and the location (pixel-position in the image), the link between data modeling and biological interpretation is retained. This near-complete overview of the data in one single image cannot be achieved even by overlaying a targeted selection of m/z images. In fact, linear methods such as PCA might often require more than three components to accurately describe the details of a complex data set. It is, of course, possible to visualize, for example, principal components 4, 5, and 6 with RGB colorcoding too. However, nonlinear methods such as SOM and tSNE can summarize such complicated data in fewer dimensions, and showed the clearest overview for complex data sets. t-SNE was specifically designed for and excels at summarizing hyperspectral data by accurately preserving and representing spectral characteristics for nonlinear mapping onto two or three dimensions.40 Thus, it is no surprise that the clearest color-coded images were from this powerful manifold learning method, and algorithmic improvements to increase the computational efficiency (t-SNE calculations took around 10 h for the presented data sets, depending on size) are ongoing.46 All analyses shown here were unsupervised: no information about the position of pixels is included in the modeling, and mapping is based purely on the mass spectra. The only user input after MSI data acquisition is the definition of a main matrix peak and a threshold for pixel selection,15 and the multivariate models require no information. Using these unsupervised methods, color-coded images were obtained that showed remarkable clustering in relation to anatomical features. The display of 1421
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
Present Address
modeling results as an image has an inherent validation, as neighboring pixels in anatomical (sub)structures are expected to show similar colors. Advancements in MSI technology are increasing the dimensionality and size of the data, with a concomitant imperative to apply appropriate and efficient computational techniques.19 Our methodology tries to address the stunning observation that the majority of researchers evaluate only a small fraction of the information present in the processed data. In fact, virtually all large spectral data sets can benefit from multivariate modeling, including mass spectrometry imaging data from different classes of molecules or acquired with different analytical approaches. We showed the results of a formalin-fixed tissue sample, highlighting the potential of this methodology to investigate legacy biosamples, which provides opportunities for retrospective studies of precious samples and large data banks.47 There is currently much interest in applying advanced phenotyping techniques for enhanced clinical performance.48 MALDI MSI and related tools offer one level of analytical entry to patient monitoring, especially in surgical situations where tissues are already being sampled for routine diagnostic histopathology, for example in the evaluation of tumor samples.24,49,50 Our studies show the potential utility of multivariate molecular image modeling for enhanced image and spectroscopic information recovery, as well as a new approach to understanding tissue structure-biochemistry correlations. The developed methodology could ultimately aid the longer-term deployment of MSI in the clinical laboratory and hospital surgical environment to enhance diagnostic and therapeutic decision-making. Overall, the challenge is to clearly demonstrate the advantages, reliability, and reproducibility of MSI and exploit these properties in a clinically translational manner. The multivariate analysis of MSI data extracts a wealth of information from these vast spectral data sets without tailoring of the methodology: the RGBencoded images show a sharp anatomical relevance and are an exceptional complement to the traditional histology image. Results on a wide range of tissue types, donors and different sample preparation methods show the robustness and ease of application of our approach, and demonstrate that rather than analysis of single m/z values, a multivariate approach should be preferred, summarizing an unrivalled volume of data in a single image. Therefore, these intuitive displays of biologically relevant structures and corresponding molecular profiles are expected to further MSI research, including applications ranging from biology to drug development and clinical healthcare.
■
#
Centre for Pathogen Evolution, Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, United Kingdom.
Author Contributions
The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS We acknowledge Laurens van der Maaten (Delft University of Technology) and Martin Spitaler (Imperial College London) for helpful discussions. This work was supported by the award of an RSC PhD studentship to JMF, and an EPSRC/RSC studentship to CLC. We gratefully acknowledge funding for this work provided by EPSRC Grant EP/F50053X/1 for studentships to A.D.P. and R.T.S. through the PSIBS Doctoral Training Centre at the University of Birmingham. We gratefully acknowledge the support of the Imperial NIHR BRC for funding for our surgical MALDI imaging program.
■
(1) Chughtai, K.; Heeren, R. M. A. Chem. Rev. 2010, 110, 3237−3277. (2) McDonnell, L. A.; Heeren, R. M. A. Mass Spectrom. Rev. 2007, 26, 606−643. (3) Stoeckli, M.; Chaurand, P.; Hallahan, D. E.; Caprioli, R. M. Nat. Med. 2001, 7, 493−496. (4) Wiseman, J. M.; Ifa, D. R.; Song, Q. Y.; Cooks, R. G. Angew. Chem., Int. Ed. 2006, 45, 7188−7192. (5) Pacholski, M. L.; Cannon, D. M.; Ewing, A. G.; Winograd, N. Rapid Commun. Mass Spectrom. 1998, 12, 1232−1235. (6) Steinhauser, M. L.; Bailey, A. P.; Senyo, S. E.; Guillermier, C.; Perlstein, T. S.; Gould, A. P.; Lee, R. T.; Lechene, C. P. Nature 2012, 481, 516−519. (7) Rauser, S.; Marquardt, C.; Balluff, B.; Deininger, S. O.; Albers, C.; Belau, E.; Hartmer, R.; Suckau, D.; Specht, K.; Ebert, M. P.; Schmitt, M.; Aubele, M.; Hofler, H.; Walch, A. J. Proteome Res. 2010, 9, 1854−1863. (8) Seeley, E. H.; Caprioli, R. M. Proc. Natl. Acad. Sci. U.S.A. 2008, 105, 18126−18131. (9) Yang, Y. L.; Xu, Y. Q.; Straight, P.; Dorrestein, P. C. Nat. Chem. Biol. 2009, 5, 885−887. (10) Khatib-Shahidi, S.; Andersson, M.; Herman, J. L.; Gillespie, T. A.; Caprioli, R. M. Anal. Chem. 2006, 78, 6448−6456. (11) Berry, K. A. Z.; Hankin, J. A.; Barkley, R. M.; Spraggins, J. M.; Caprioli, R. M.; Murphy, R. C. Chem. Rev. 2011, 111, 6491−6512. (12) Goodwin, R. J. A. J. Proteomics 2012, 75, 4893−4911. (13) Wang, H. Y. J.; Liu, C. B.; Wu, H. W. J. Lipid Res. 2011, 52, 840− 849. (14) Wang, H. Y. J.; Wu, H. W.; Tsai, P. J.; Liu, C. B. Anal. Bioanal. Chem. 2012, 404, 113−124. (15) Fonville, J. M.; Carter, C.; Cloarec, O.; Nicholson, J. K.; Lindon, J. C.; Bunch, J.; Holmes, E. Anal. Chem. 2012, 84, 1310−1319. (16) Norris, J. L.; Cornett, D. S.; Mobley, J. A.; Andersson, M.; Seeley, E. H.; Chaurand, P.; Caprioli, R. M. Int. J. Mass spectrom. 2007, 260, 212−221. (17) Deininger, S. O.; Cornett, D. S.; Paape, R.; Becker, M.; Pineau, C.; Rauser, S.; Walch, A.; Wolski, E. Anal. Bioanal. Chem. 2011, 401, 167− 181. (18) Dai, Y. Q.; Whittal, R. M.; Li, L. Anal. Chem. 1996, 68, 2494− 2500. (19) Watrous, J. D.; Alexandrov, T.; Dorrestein, P. C. J. Mass Spectrom. 2011, 46, 209−222. (20) Hanselmann, M.; Kothe, U.; Kirchner, M.; Renard, B. Y.; Amstalden, E. R.; Glunde, K.; Heeren, R. M. A.; Hamprecht, F. A. J. Proteome Res. 2009, 8, 3558−3567.
ASSOCIATED CONTENT
S Supporting Information *
Resulting images of RGB encoding of other hyperspectral embedding methods and multivariate modeling of different tissue MSI data sets; and example processed MSI data and runnable code to calculate multivariate models and perform RGB color-coding. This material is available free of charge via the Internet at http://pubs.acs.org.
■
REFERENCES
AUTHOR INFORMATION
Corresponding Author
*Tel: +44(0)121418810 (J.B.); +44(0)2075943220 (E.H.). Fax: +44(0)121414403 (J.B.); +44(0)2075943226 (E.H.). E-mail: j.
[email protected] (J.B.);
[email protected] (E.H.). 1422
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423
Analytical Chemistry
Article
(21) McCombie, G.; Staab, D.; Stoeckli, M.; Knochenmuss, R. Anal. Chem. 2005, 77, 6118−6124. (22) Dill, A. L.; Ifa, D. R.; Manicke, N. E.; Costa, A. B.; Ramos-Vara, J. A.; Knapp, D. W.; Cooks, R. G. Anal. Chem. 2009, 81, 8758−8764. (23) Dill, A. L.; Eberlin, L. S.; Zheng, C.; Costa, A. B.; Ifa, D. R.; Cheng, L. A.; Masterson, T. A.; Koch, M. O.; Vitek, O.; Cooks, R. G. Anal. Bioanal. Chem. 2010, 398, 2969−2978. (24) Balluff, B.; Elsner, M.; Kowarsch, A.; Rauser, S.; Meding, S.; Schuhmacher, C.; Feith, M.; Herrmann, K.; Rocken, C.; Schmid, R. M.; Hofler, H.; Walch, A.; Ebert, M. P. J. Proteome Res. 2010, 9, 6317−6322. (25) Deininger, S. O.; Ebert, M. P.; Futterer, A.; Gerhard, M.; Rocken, C. J. Proteome Res. 2008, 7, 5230−5236. (26) Alexandrov, T.; Becker, M.; Deininger, S. O.; Ernst, G.; Wehder, L.; Grasmair, M.; von Eggeling, F.; Thiele, H.; Maass, P. J. Proteome Res. 2010, 9, 6535−6546. (27) Bruand, J.; Sistla, S.; Meriaux, C.; Dorrestein, P. C.; Gaasterland, T.; Ghassemian, M.; Wisztorski, M.; Fournier, I.; Salzet, M.; Macagno, E.; Bafna, V. J. Proteome Res. 2011, 10, 1915−1928. (28) Kohonen, T. Proc. IEEE 1990, 78, 1464−1480. (29) Van der Maaten, L.; Hinton, G. J. Mach. Learn. Res. 2008, 9, 2579−2605. (30) Izenman, A. J. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning; Springer Science: New York, 2008. (31) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, 2009. (32) Ma, Y., Fu, Y., Eds. Manifold Learning Theory and Applications; CRC Press: Boca Raton, FL, 2012. (33) Simmons, D. A. Improved MALDI-MS Imaging Performance Using Continuous Laser Rastering; Technical Note; ABS Applied Biosystems: Bedford, MA, 2008. (34) Trim, P. J.; Djidja, M. C.; Atkinson, S. J.; Oakes, K.; Cole, L. M.; Anderson, D. M. G.; Hart, P. J.; Francese, S.; Clench, M. R. Anal. Bioanal. Chem. 2010, 397, 3409−3419. (35) Jurchen, J. C.; Rubakhin, S. S.; Sweedler, J. V. J. Am. Soc. Mass. Spectrom. 2005, 16, 1654−1659. (36) Carter, C. L.; McLeod, C. W.; Bunch, J. J. Am. Soc. Mass. Spectrom. 2011, 22, 1991−1998. (37) Vesanto, J.; Himberg, J.; Alhoniemi, E.; Parhankangas, J. Report A57: SOM toolbox for Matlab 5, 2000, http:/www.cis.hut.fi/projects/ somtoolbox. (38) Vesanto, J. Intell. Data Anal. 1999, 3, 111−126. (39) Gross, M. H.; Seibert, F. Visual Comput. 1993, 10, 145−159. (40) Van der Maaten, L. J. P. J. Mach. Learn. Res. 2009, 5, 384−391. (41) Wolkenstein, M.; Hutter, H.; Mittermayr, C.; Schiesser, W.; Grasserbauer, M. Anal. Chem. 1997, 69, 777−782. (42) Shrivas, K.; Hayasaka, T.; Goto-Inoue, N.; Sugiura, Y.; Zaima, N.; Setou, M. Anal. Chem. 2010, 82, 8800−8806. (43) Wang, H. Y. J.; Post, S. N. J. J.; Woods, A. S. Int. J. Mass Spectrom. 2008, 278, 143−149. (44) Jackson, S. N.; Ugarov, M.; Post, J. D.; Egan, T.; Langlais, D.; Schultz, J. A.; Woods, A. S. J. Am. Soc. Mass. Spectrom. 2008, 19, 1655− 1662. (45) Sugiura, Y.; Konishi, Y.; Zaima, N.; Kajihara, S.; Nakanishi, H.; Taguchi, R.; Setou, M. J. Lipid Res. 2009, 50, 1776−1788. (46) Van der Maaten, L. J. P. Neural Inf. Process. Syst. (NIPS) 2010 Workshop Challenges Data Visualization: Fast Optimization for t-SNE, 2010. (47) Seeley, E. H.; Caprioli, R. M. Trends Biotechnol. 2011, 29, 136− 143. (48) Kinross, J. M.; Holmes, E.; Darzi, A. W.; Nicholson, J. K. Lancet 2011, 377, 1817−1819. (49) Kang, S.; Shim, H. S.; Lee, J. S.; Kim, D. S.; Kim, H. Y.; Hong, S. H.; Kim, P. S.; Yoon, J. H.; Cho, N. H. J. Proteome Res. 2010, 9, 1157− 1164. (50) Agar, N. Y. R.; Malcolm, J. G.; Mohan, V.; Yang, H. W.; Johnson, M. D.; Tannenbaum, A.; Agar, J. N.; Blacks, P. M. Anal. Chem. 2010, 82, 2621−2625. 1423
dx.doi.org/10.1021/ac302330a | Anal. Chem. 2013, 85, 1415−1423