Biospectroscopy to metabolically profile biomolecular structure: a

(2, 3) The secondary structure of proteins is associated with nine Amide bands: A, ..... of clusters by modes and k-prototypes used to cluster mixed-t...
0 downloads 0 Views 2MB Size
REVIEWS pubs.acs.org/jpr

Biospectroscopy to metabolically profile biomolecular structure: a multistage approach linking computational analysis with biomarkers Jemma G. Kelly,† Julio Trevisan,†,‡ Andrew D. Scott,§ Paul L. Carmichael,§ Hubert M. Pollock,† Pierre L. Martin-Hirsch,|| and Francis L. Martin*,† †

Centre for Biophotonics, Lancaster Environment Centre, Lancaster University, Lancaster, United Kingdom School of Computing and Communications, Infolab21, Lancaster University, Lancaster, United Kingdom § Safety and Environmental Assurance Centre, Unilever Colworth Science Park, Bedfordshire, United Kingdom Lancashire Teaching Hospitals NHS Trust, Preston, United Kingdom

)



ABSTRACT: Biospectroscopy is employed to derive absorbance spectra representative of biomolecules present in biological samples. The mid-infrared region (λ = 2.5 μm-25 μm) is absorbed to give a biochemical-cell fingerprint (v~ = 1800-900 cm-1). Cellular material produces complex spectra due to the variety of chemical bonds present. The complexity and size of spectral data sets warrant multivariate analysis for data reduction, interpretation, and classification. Various multivariate analyses are available including principal component analysis (PCA), partial least-squares (PLS), linear discriminant analysis (LDA), and evolving fuzzy rule-based classifier (eClass). Interpretation of both visual and numerical results facilitates biomarker identification, cell-type discrimination, and predictive and mechanistic understanding of cellular behavior. Biospectroscopy is a high-throughput nondestructive technology. A comparison of biomarkers/mechanistic knowledge determined from conventional approaches to biospectroscopy coupled with multivariate analysis often provides complementary answers and a novel approach for diagnosis of disease and cell biology. KEYWORDS: biomarkers, biospectroscopy, infrared spectroscopy, multivariate analysis, principal component analysis, screening, diagnosis, machine learning

1. INTRODUCTION Spectroscopy is increasingly recognized as a powerful technology in biomedical research.1 Pioneered by physicists and chemists, it has traditionally been used for the identification of molecular entities in unknown substances. It is now used in many research laboratories to measure the relative quantities of biochemical components present in cells and tissues; this has given rise to the field of “biospectroscopy”. As technologies improve, we are increasingly able to acquire spectra from the smallest cellular components, organelles. The mid-infrared (IR) frequency range of the electromagnetic spectrum (λ = 2.5-25 μm or in wavenumbers, ~v = 1/λ = 4000-400 cm-1) is absorbed by biomolecules.2,3 Within the mid-IR range, 1800-900 cm-1 is regarded as the biochemical-cell fingerprint region, because it contains the fundamental vibrational modes of the structures present in biological specimens. Biospectroscopy encompasses a number of approaches: Fourier-transform IR (FTIR), attenuated total reflection FTIR (ATRFTIR), Raman and photothermal microspectroscopy (PTMS). FTIR microspectroscopy can be employed where chemical bonds are present and known to exhibit a certain degree of movement in r 2011 American Chemical Society

the form of bending, stretching and rotating, allowing the absorption of IR electromagnetic energy through a changing dipole moment. Raman spectroscopy may also be employed to measure the presence of biochemical bonds, based on their vibrations, by detecting the Raman scattering. Raman scattering is also dependent on the vibrations of chemical bonds but is not restricted to those with dipole moments, and is unaffected by aqueous conditions. Raman and FTIR spectroscopy are therefore considered to be complementary techniques. Application of ATR-FTIR spectroscopy includes distinguishing cells at different stages of the cell cycle as well as different cell types and grades of cervical cancer by identifying important underlying chemical differences represented in the IR spectra.2 Detection of cancerous changes within tissue types has enhanced the understanding and knowledge of the disease, for example, cervical cancer, prostate cancer,5,6 skin cancer and colorectal cancer.1 The application of biospectroscopy generates spectra containing hundreds of variables, absorbance intensities at each wavenumber; Received: October 24, 2010 Published: January 06, 2011 1437

dx.doi.org/10.1021/pr101067u | J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research this can result in the production of large data sets representing a large number of biological samples, often with several spectra being acquired for each sample. Spectra from one sample may exhibit heterogeneity, and to our experience, most data set variability is found within data classes (i.e., cancer type, treatment regimen, tissue type etc) rather than between classes. Multivariate analysis is required to handle the data appropriately; in comparison, univariate analysis techniques may detect differences between classes but they do not account for statistical dependencies between wavenumbers and thus are outperformed by multivariate analyses.7 The mathematical models employed typically have two purposes: exploratory analysis and/or classification of samples. The former helps the understanding of biochemical effects related to each class in the data set. The latter foresees biospectroscopy as a useful tool for realworld screening or diagnosis in different situations. Some interpretable discriminatory techniques suit both purposes. Prior to analysis of interest, data sets most often undergo preprocessing steps that allow spectra to be compared. Still, depending on the analysis technique, the data set may be inputted into a feature extraction algorithm where the purpose can be either to reduce the number of variables in order to prevent overfitting, or to produce variables that incorporate domain knowledge (e.g., peak detection), or both. The choice of method for each of these steps is highly dependent on the biological samples, instrumentation technique, and purpose of analysis. Conventional Biological Analysis Techniques

Methodologies more commonly used in research include, but are not limited to, metabolomics, proteomics and gene expression analysis. These techniques may also produce large complex data sets where the importance of applying the correct analysis is particularly important in order to avoid errors introduced by technical bias.8,9 The Syrian hamster embryo assay depends on subjective scoring by a pathologist and simple statistical significance tests to predict chemical carcinogenicity.10 There is evidence that IR spectroscopy can be employed as an objective tool to perform such prediction.11 Additionally, the majority of current histopathology techniques use complex procedures and staining to enable section architectural visualization; this can be time-consuming and subjective as well as rendering samples nontransferable to other analysis techniques. Advantages of Biospectroscopy

The major advantage of spectroscopy is it can be applied to any biological sample, for example, archived pathology samples, without requiring staining or isotope-labeling; the method is nondestructive and samples may be reused.7 The clinical implications of spectroscopy are vast, in particular the recent advancements enabling spectra to be acquired in vivo, potentially in the operating theater.12 Biospectroscopy is potentially a powerful high-throughput technique complementary to conventional laboratory approaches used in biomedical research. Spectroscopy provides a rapid means of identifying small changes, results can be quickly produced and it uses comparatively small amounts of reagents and consumables.13 Spatial measurements with Raman spectroscopy may be 1 μm  1 μm, allowing interrogation of individual cells or even nuclei. Aims

The potential of spectroscopy as a clinical tool hinges on identifying and implementing a successful multistage approach that could be used as a novel diagnostic/screening tool facilitating earlier detection of disease. This could result in patients receiving earlier

REVIEWS

Figure 1. Overview of potential analyses following spectral acquisition. Flow diagram of the computational processing available following spectral acquisition; encompassing preprocessing, feature extraction and multivariate analysis.

treatment and, in the long term, reduce mortality. Furthermore, biomarker extraction following exploratory analysis of larger data sets facilitates the discovery of potential mechanisms in the development of atypical cells; this might give rise to novel drug targets. IR spectroscopy is a diverse technology platform with functions ranging across biological and environmental applications to be used as a complementary technology alongside existing molecular techniques. The aim of this review is to explore the importance of appropriately linking biospectroscopy techniques with multivariate analysis. Biospectroscopy instrumentation results in the generation of large amounts of data where appropriate processing is key to facilitate a robust and reliable extraction of information, biomarkers, and classification models (Figure 1). Following an explanation about the fundamentals of biospectroscopy and some of its most common instrumentation techniques, computational methods specifically focusing on feature extraction or selection, exploratory analysis, and classification will be examined. The applications of biospectroscopy techniques coupled with multivariate analysis will be reviewed.

2. BIOSPECTROSCOPY TECHNIQUES IR Spectroscopy

Biomolecules absorb IR due to the vibrational movements of their chemical bonds, for example, bending, stretching, rocking, wagging or scissoring. As the movements occur at specific energy levels, following IR exposure the chemical bonds absorb IR at wavelengths correlating to the energy levels. IR spectroscopy is the measure of this absorption, producing spectra with peaks representing the chemical bonds present in the given sample. The IR region of the electromagnetic spectrum is subdivided into near-IR, far-IR and mid-IR region. The latter (v~ = 4000-400 cm-1) is particularly important for biological applications and contains the biochemical-cell fingerprint region (v~ = 1800-900 cm-1). This region is known to be absorbed by a number of biochemicals and in IR spectroscopy generates identifiable peak absorption frequencies, including: Amide I (∼1650 cm-1), Amide II (∼1550 cm-1), Amide III (∼1260 cm-1), carbohydrates (∼1155 cm-1), glycogen (∼1030 cm-1), lipids (∼1750 cm-1), asymmetric phosphate stretching vibrations (νasPO2-; ∼1225 cm-1), symmetric phosphate stretching vibrations (νsPO2-; ∼1080 cm-1) and protein phosphorylation (∼970 cm-1) (Figure 2A).2,3 The secondary 1438

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research

REVIEWS

Figure 2. IR and Raman spectra. Example of a biochemical-cell fingerprint of (A) IR and (B) Raman spectra, tentative peak assignments of biochemicals are labeled.

structure of proteins is associated with nine Amide bands: A, B and I-VII, distinct according to decreasing frequency. Amide I is centered on the carbonyl group (CdO) and is particularly sensitive to changes in β-sheet formation.13 Figure 2A shows a vibrational spectrum representative of the IR absorbed by cervical cells, Fourier-transformed and converted into an absorbance spectrum. By applying this methodology to biological samples, acquiring spectra from for example, tissues, and appropriately analyzing the data, differences between spectra may be isolated and related back to the location on the sample from which they were acquired. This may provide information about the underlying biochemistry from different areas of tissue or facilitate comparison of tissue from different individuals. Overall, FTIR spectroscopy coupled with multivariate analysis has significant potential in cell identification and disease characterization. FTIR Microspectroscopy and Instrumentation

Common IR light sources are globar (blackbody) or synchrotron-based radiation. Globar is a silicon carbide thermal mid-IR

source, emitting radiation from λ = 2.5 μm-25 μm (v~ = 4000400 cm-1). Synchrotron-based radiation is ∼1000 times brighter than globar sources and gives IR spectra with a much higher signal-to-noise ratio (SNR). It is produced when electrons accelerated at speed by a linear accelerator are forced into a circular motion by bending electromagnets perpendicular to the electron current, causing electrons to release synchrotron radiation. Synchrotron-based radiation has a broad electromagnetic spectrum but various wavelengths can be siphoned off into beamlines, for example, the mid-IR. The brightness and energy of synchrotron-based FTIR allows for smaller sampling apertures of 5-10 μm, extremely advantageous for biological applications. Point vs Imaging

There are three approaches toward spectral acquisition with an FTIR spectrometer: (1) from individual points; (2) sequentially across a grid or along a line to produce an image map; and (3) using an array detector to produce an image.14 The first is ideally used for interrogation of many areas across a sample to facilitate a 1439

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research biochemical comparison across samples, for example, epithelial cells in different glandular elements. The image map and imaging techniques are preferentially used to study the tissue morphology of particular areas of interest to perhaps identify cell types, for example, stem cells (SCs), transient-amplifying (TA) cells or terminally-differentiated (TD) cells in the gastrointestinal crypt. Single cell spatial resolution is required for image-map analysis, which until recently was only achievable with a synchrotron-based light source. Recent developments of array detectors (grid or linear) have allowed imaging technologies to interrogate specimens at single-cell resolution much faster than single-channel detectors with synchrotron-based radiation.14 The former uses a globar source and contains an array of detector elements, allowing measurement of large areas simultaneously. Although more information can be acquired in a shorter length of time, often this is at the expense of spectral quality and there is some debate over the distribution of spectra in relation to the spatial location of the interrogated area and the occurrence of overlapping spectral information. If the aim of the experiment is to distinguish different regions or populations within an image map then an array detector may be suitable. Transmittance and Transflection Modes

In transmittance mode the IR beam is directed through a sample where it is collected by a condenser whereas, in transflection mode it is directed through the sample, reflects off the reflective slide and travels back through the sample to the detector, that is, the IR beam traverses the sample twice.15 Sample thickness is very important for both these measurements. If too thick, the IR beam will be attenuated beyond the range where absorption is proportional to chemical concentration; is too thin, the absorption will be too low and the acquired spectral signal will be flooded in noise.

REVIEWS

photons are due to elastic scattering. The latter is filtered out so only the Raman scattering reaches the detector, and this is used to produce a spectrum containing peaks corresponding to the wavelengths at which Raman scattering occurred (Figure 2B). Water has negligible Raman scattering, making this approach feasible for live-cell imaging under aqueous conditions.17 Raman is a diverse spectroscopic technique, as there are several different technologies available for various applications, including: spatially offset Raman spectroscopy (SORS) which allows detection offset from the point of laser excitation to allow below surface measurements to be recorded, for example, subcutaneously in vivo,18 and surface-enhanced Raman spectroscopy (SERS) whereby the signal at the surface is increased by using silver- or gold-labeling, which increases sensitivity to allow singlemolecule detection. Photothermal Microspectroscopy (PTMS)

PTMS uses an FTIR spectrometer, scanning probe microscope and an atomic force microscope (AFM)-type cantilever probe.15 IR radiation is transmitted onto a sample, causing a temperature rise: the thermal probe is placed a few 100 nm above the sample to detect this. The temperature rise causes thermal waves to be generated above the sample, causing the thermal probe to vibrate.15 The vibrations are sensed, and a laser is reflected off the back of the probe into a photodiode detector. This data is Fourier-transformed and a true absorbance spectrum is produced.15 As temperature changes are recorded directly from above the samples’ surface, optically opaque samples may be analyzed.15 Additionally, the resultant spectra are considered to be “true absorbance spectra” unaffected by dispersion affects. PTMS is a sensitive, noncontact technique and any risk of contamination is reduced.15

ATR-FTIR Spectroscopy

ATR-FTIR spectroscopy, typically used with a globar source, utilizes the same principals as FTIR spectroscopy but uses a different technique to transmit the IR to the sample and collect the absorption values.16 A crystal, of say a diamond of size ∼250 μm  250 μm, is in contact with the sample and the IR beam is passed through the crystal within which it is totally internally reflected (no refraction occurs); when reflected, an evanescent wave penetrates a few μm beyond the crystal into the sample. The sample in contact with the crystal thus absorbs this IR and attenuates it in a detectable fashion.16 Raman Spectroscopy

Raman spectroscopy is also a robust, noninvasive, highthroughput methodology which is used to produce spectra representative of chemical entities in a sample, and capable of detecting cellular changes (Figure 2B). Raman spectroscopy detects a wide range of chemical bonds and, unlike mid-IR spectroscopy, is not limited to the detection of polar bonds: the two techniques are considered complementary because many bands that are weak in the IR spectrum are among the strongest bands in the Raman spectrum.2,14 Raman spectroscopy involves measuring how the photons from a monochromatic light source, that is, laser, interact with the chemical bonds in molecules. Photons are absorbed and released back, causing the vibrational energy of the molecule to increase and subsequently decrease. If the molecule does not return to its previous vibrational energy state, the photon released has a frequency shift which maintains the balance of the system. This shift is known as inelastic light scattering or Raman scattering; it occurs with less than 1% of photons absorbed by the molecules and, the other 99%þ emitted

3. COMPUTATIONAL PROCESSES From the point when absorption is measured and converted into IR spectra, biospectroscopy is essentially a numerical art. If the legitimacy of sample interrogation through spectroscopy is strongly physically grounded, the resulting measurements present issues that need be solved through preprocessing; further on, answering biological questions translates into applying suitable computational methods. Therefore, understanding of the numerical techniques applied to the field is increasingly relevant knowledge to biospectroscopy researchers and professionals. Preprocessing

Following spectral acquisition the spectra must be preprocessed to account and correct for noise, sloping baseline effects, differences in sample thickness/concentration, and to select the regions of interest. Preprocessing can be generally categorized into cutting, baseline correction, and normalization (Figure 1). From these, baseline correction is most challenging.19 The region of interest in the study of biological specimens is 1800-900 cm-1, referred to above as the biochemical-cell fingerprint. This is the region commonly used in FTIR spectroscopy but maybe increased in Raman spectroscopy due to the extended number of chemical bonds detectable and differences in detection frequencies. The spectral region selected in Raman spectroscopy is ∼2000-500 cm-1. Wavenumbers known not to be absorbed by a sample should have an absorbance value of zero. However, often the IR spectra are raised above zero in such regions. It is thought these affects are mainly due to Mie scattering, caused when the wavelength of IR light is comparable or smaller than some of the biochemical 1440

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research structures through which it is passing, causing the light to scatter. This results in underlying oscillations throughout the spectral range, resulting in spectra whose absorption measurements do not truly represent the sample chemical concentrations.20,21 Other causes of a skewed baseline may be reflection, temperature, concentration, or instrument anomalies. Whichever the cause, these effects may be partially/tentatively removed by a variety of baseline correction techniques. A popular baseline technique, available with Bruker Opus software, is the rubberband baseline correction, which stretches the spectra down so minima areas of the spectral region of interest are used to fit a convex polygonal line (i.e., the rubber-band) which is then subtracted from the original spectrum. Alternatively one may differentiate the spectra twice (“2nd differentiation”), as this method will cause the spectra to lose their slope (1st-order polynomial fit); however, after differentiation the spectra lose their original shape. Also, the only appropriate normalization technique following differentiation is vector normalization (see below). Mie scattering may be explicitly treated by a new technique that is increasingly becoming popular.20,21 Such techniques seek to fit the Mie scattering contaminant spectral patterns to the measured spectrum towards later subtracting these patterns off. However, applying this technique may cause horizontal peak shifts at certain regions of the spectra; this is controversial as protein conformation may also cause a shift in spectra: this has been observed in particular with Amide I. Normalization is employed to scale the spectra and remove spectral changes accountable by the thickness or concentration of the sample, thus making all spectra in a batch comparable to each other. Min-max normalization may be applied when a known peak is stable and consistent across the specimens; commonly, Amide I is used for this purpose. Alternatively, vector normalization may be applied, whereby each spectrum is divided by its Euclidean norm, therefore not relying on one peak. Feature Extraction

In accordance with statistical terminology, “features” is a synonym to “input variables”, that is, the inputs to the subsequent analysis method. In a number of cases, the wavenumber absorbance intensities can be used as features, but more often than not a prior step is required to reduce the number of variables in order to avoid the so-called “curse of dimensionality”.19 This “curse” can manifest as overfitting which can lead to poor performance of classifiers when tested on independent data, or wrong hypotheses drawn from exploratory analysis.19 Generally, it is recommended that the number of spectra be several times (5-10 at least) more than the number of variables/features in the data set.19,22,23The number of variables can be reduced either by feature construction or feature selection.23 Feature Construction

These techniques build new variables from the wavenumbers by combining these in different ways.19 Common techniques for this purpose are linear techniques such as principal component analysis (PCA), linear discriminant analysis (LDA), PCA-LDA, or partial least-squares (PLS), where the constructed variables are linear combinations of the wavenumbers (see below). Feature construction can be also embedded into computational methods, such as support vector machine (SVM).24 Other examples include peak detection, where the features will be the horizontal peak locations, wavelet transformation,25 and classifier-specific techniques.26

REVIEWS

Feature Selection

Feature selection consists of retaining only some of the wavenumbers according to a selection criterion.23 However, it is practically impossible to find the best subset of from all the existing 2n possibilities, where n is the number of wavenumbers originally present in the data set. Therefore, several suboptimal feature selection strategies exist, the majority of which can be divided into “filter methods” and “wrapper methods”.23,27 Filter methods rank the relevance of the wavenumbers individually based on an evaluation criterion, for example, t-test, Pearson correlation with the class, F-test or univariate classification. This ranking is then used to decide upon which wavenumbers will be retained. However, filter methods do not take into account statistical dependencies between wavenumbers, because only one wavenumber is assessed at a time. Wrapper methods, on the other hand, consider subsets of wavenumbers in the selection process. In this process, the selection algorithm is “wrapped around” a classifier that is used to rank the subsets. Because an exhaustive search among all 2n subsets is infeasible, wrapper methods need to adopt a search strategy. This includes forward search, backward search, forward-backward search, genetic algorithms and the branch-and-bounds algorithm.19,23,27

4. COMPUTATIONAL ANALYSIS Principal Component Analysis (PCA)

PCA is an unsupervised technique employed to reduce dimensionality and generate a visualization of data.7,28 It is a linear transformation of the wavenumber data set operated by the PCA loadings matrix. The loadings vectors (commonly called principal components [PCs]: PC1, PC2, PC3 etc) within this matrix are eigenvectors of the covariance matrix of the data (Figure 3). Each loadings vector contains the coefficients of a linear combination that generates one new variable called a PCA factor. The PCs are orthogonal to each other (i.e., each loadings forms an angle of 90° with all the other loadings), thus the PCA factors are uncorrelated. Each PC has a corresponding eigenvalue which exactly matches the variance of its corresponding PCA factor, enabling these factors to be ranked according to the magnitude of variance captured by each one.29 The number of PCs is less than or equal to the number of samples or variables, whichever is smallest. Each spectrum may be plotted as one point in 3D, using three PCs of choice and plotting the spectra in the linearly transformed PC space. The first 3 PCs are commonly used as they contain the most variance, often up to 99%, ensuring optimum visualization of the spread of the data.29 The consequence that PCA is unsupervised is 2-fold. On one hand, such a technique is unbiased in the sense that no information about the data classes is inputted into the algorithm and therefore the derived scores, loadings and cluster vectors will represent true variability within the data (Figure 3), making PCA somewhat similar to clustering when used as exploratory analysis. On the other, PCA will separate data classes only if variability is primarily associated to such classes, which to our experience is not true with a number of spectroscopic data sets. Partial Least-Squares (PLS)

PLS constructs a set of linear combinations of the wavenumbers the same way as PCA, but uses the data classes in the construction (therefore PLS is a supervised technique).7 Where PCA ranks the PCs according to variance within the data, PLS 1441

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research

REVIEWS

employs a different approach. It finds a sequence of new variables that are maximally correlated with a numerical representation of the data classes while being orthogonal (uncorrelated) to each other.26,30 One disadvantage of PLS is the complicated mathematics involved in the correlation of wavenumbers and data classes for complete interpretation, as PLS has a tendency to overfit and requires more validation than PCA.31

calculated by taking the average between all points belonging to the class. This approach is particularly useful when there are many classes and interpretation of the scores and loadings plots is both complex and subjective. Recently, a different construction was proposed where instead of departing from the origin of the space in case, the cluster vectors start at the center of a chosen reference class (e.g., “control” or “normal”).35

Linear Discriminant Analysis (LDA)

Cluster Analysis

LDA is a supervised technique which forms linear combinations of variables dependent on differences between the classes in the data set.30,32 Formally, the LDA loadings vectors are successive orthogonal solutions to the problem to “maximize the between-class variance over the within-class variance”.26 After LDA, the data set will have only c - 1 variables, where c is the number of data classes. The Mahalanobis distances between a point and the class centers in the wavenumber-variable space is equivalent to the corresponding Euclidean distances in the LDAtransformed space. LDA and Mahalanobis distance are techniques that account for the ellipsoidal shapes of data clusters.30

Clustering techniques are unsupervised methods that aim to group the spectra into classes.7 Clustering is applied when information about data classes is not available, or one wishes to see the patterns that emerge from the data set without using the data classes. The latter makes sense in at least two different situations: (1) it is of interest to know whether the classes assigned to the spectra by clustering agree with the a priori class labels;14 or (2) to facilitate manually entering training data classes in image classification studies, so that the user only needs to assign labels to a few groups instead of to thousands of spectra.

PCA-LDA

This technique is actually the cascade application of LDA on the factors resulting from PCA. An important consideration when applying LDA following PCA is how many PCs to include; too few results in a lack of information, while too many increases the amount of noise in the data, and may allow for LDA overfitting.31 Using 10 factors is a compromise on both situations; further, it has been shown that more than 99% of variance is captured in the first 10 PCs and more than 20 PCs includes too much noise,30 degrading the interpretation of the loadings plots.31 Visualization: Scores and Loadings Plots

Resulting scores and loadings plots provide a visual representation and interpretation of variables responsible for any segregation.32 Following the construction of factors from either of the techniques above, the factor values (i.e., factor scores) can be used as Cartesian coordinates for 1-, 2-, or 3-D scatter plots (scores plots). Furthermore, because the loadings vectors of either technique have the same resolution as the original spectra, their coefficients can be plotted against the wavenumber axis to reveal the contributions of each wavenumber to form each corresponding factor. Loadings are intended to provide class separation and compactness; however, the loadings also provide distinguishing wavenumbers. Wavenumbers, specific to categories may be extracted from the loadings plots and translated into their corresponding biochemicals and identified as potential biomarkers to specific categories. Alternatively, if a wavenumber consistently appears in loadings one may refer to the spectra to identify the changing absorbance intensity across all categories: it could indicate it to be a biomarker of a progression pattern across categories, for example, disease progression. Cluster Vector Approach

This is a geometric construction first proposed in German et al. (2006) which can be applied to any of the linear techniques described above, although it is more often employed following PCA-LDA.33,34 The idea of cluster vectors follows from the fact that loadings vectors are found to be more informative when they “pass through” data points instead of pointing toward void space (Figures 3 and 4). Thus, there is one cluster vector for each data class. Each cluster vector is the vector that points from the origin to the center of its corresponding data class in the vector space spanned by the loadings vectors. The center of a given class is

Hierarchical Clustering

Hierarchical clustering is a partitioning algorithm, and two forms exist: agglomerative and divisive. Agglomerative begins forming clusters by merging two data points, based on similarity, at a time; this is continued until all of the data points belong to a cluster.36 Agglomerative is less suited to large data sets because of the immense computational requirements of the method.36 Divisive clustering works in the opposite way to agglomerative by starting out with all of the data belonging to one cluster. Then by similarity measures, the cluster divides into two; this step is repeated until all data points are their own clusters. Hierarchical clustering is visualized by a “dendrogram”: agglomerative and divisive clustering on the same data set should produce the same diagram except they will be mirror images of one another. To find the clusters the algorithm can be stopped at any point and clusters obtained; this does not give specific validity or confidence that clustering is correct. K-Means Clustering (KM)

The KM algorithm is well-known for being robust and is a further clustering technique utilizing previously given class membership values and therefore the number of clusters.37 The downside of this methodology is that it is dependent on the initial solution and the order in which the data is inputted into the algorithm.37 KM is particularly useful when handling large data sets as it uses differences in the data to organize it into clusters. Two developments of KM have also been produced: k-modes which replace means of clusters by modes and k-prototypes used to cluster mixed-type objects. Fuzzy k-means has been used to monitor gene expression in yeast and to try to overcome the complexity of coregulated genes. A simple explanation of the method: (1) determine the center of the clusters; (2) determine the distance between all data and the centers, one center at a time; (3) assign to clusters based on minimum distances; and, (4) repeat steps 1-3 until no more data are reassigned.37 Fuzzy C-Means Clustering (FCM)

This clustering method uses a membership function and then assigns a value from 0 to 1 to each data point relating to the amount of similarity or “membership” to the clusters.38 A data point, for example, a spectrum, is assigned a value from 0 to 1; 0 indicates the minimal membership to a cluster while 1 indicates maximum membership. The results may be displayed as images of the samples, pseudocolored for intensity according to mem1442

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research

REVIEWS

Figure 3. Schematic of PCA and PCA-LDA. Various output options available are provided at each stage, and accompanying formulae. In the matrix representation, one can see the reduction of the number of columns of the data matrix operated by the successive application of PCA and LDA. The numbers of variables 235 and 10 are provided as examples only to give an idea of variable reduction. LPCA, loadings matrix that operates the PCA transformation; LLDA, loadings matrix that operates the LDA transformation. Scores, cluster vectors and loadings of the visualization of PCA are graphically represented.

Classifiers

Figure 4. Interpretation of PCA-LDA. PCA-LDA and subsequent interpretation including an example of how results may produce biomarkers or reveal mechanistic information.

bership values.38 FCM is very similar to KM except that it has an additional component known as a “fuzzifier” which controls the amount of fuzziness each data point has to a cluster.38 This fuzzifier can affect the stability of cluster assignments; therefore, it is often set at a number that will not cause this.39

Classifiers are models that learn from training data sets with the aim of being used to subsequently predict the class of spectra whose classes are unknown. The algorithms are borrowed from the field of machine learning and there are perhaps thousands of different models and variants. Thus, choosing what classifier to use in a biospectroscopy scheme is far from straightforward. This decision usually involves a trade-off of the following factors: (1) classifier efficiency, that is, the ability to correctly predict the class of unknown samples; (2) computational time; and (3) model interpretability.40 Classifier efficiency generally depends on many factors, including: (1) statistical distributions of the spectra within each data class for the particular data set in hand; (2) number of variables; and (3) number of spectra available for training the model. Usually, little is known a priori data distributions, such that different classifiers need to be tested.26,39 More variables mean more information, but redundant variables also mean more noise added;19 finally, the “minimum necessary” number of spectra to reliably train a classifier is dependent on model complexity, with more complex models requiring more spectra due to more degrees of freedom. Computational time may be a constraint that may bias the choice toward a simpler model even if its accuracy is poorer than that of a more complex one. Interpretability is particularly interesting in biospectroscopy, as it means relating important wavenumbers for classification with biochemical 1443

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research components in the samples. Efficiency is commonly measured as the classification rate, defined as the percentual correct guesses over a set of unknown-class test spectra.40 eClass

eClass is an evolving fuzzy classifier. In the training phase, fuzzy rules are formed or deleted according to the algorithm heuristics which is designed to provide generality, space coverage and simplicity (the latter meaning as few rules as possible). eClass learns incrementally, that is, spectra are read one by one to evolve the model structure and adapt its parameters. Rules are formed from spectra chosen to be “prototypes”; more than one rule can exist per class and may identify specific wavenumbers of interest, as well as those reliable in the prediction process.41,42 A fuzzy rule consists of linguistic “IF-THEN” statements of the general shape “IF spectrum is...THEN class is...”. eClass also allows for embedded online feature selection. Other options include controlling the number of rules produced, and the learning rate. In the testing stage, an unknown spectrum is compared to the rules producing a “firing level” value for each rule. Firing levels are combined at the output to produce an overall conclusion about which class the spectrum belongs to. An interesting feature of eClass is that the “IF-THEN” rules are human-readable.43 eClass classification time is proportional to the number of rules, which tend to stabilize after a certain number of spectra have been fed into training. k-Nearest Neighbors (k-NN)

The training phase of k-NN consists of simply storing the training data set. In the classification phase, given an unknown spectrum x, the algorithm finds the k spectra, inside the training data set, closest to x, and classifies x according to the majority vote among these k spectra. Because k-NN has no model structure, the relationship between the features and the class output cannot be analyzed.26 Artificial Neural Network (ANN)

The ANNs commonly used in the analysis of spectroscopic data are of the multilayer perceptron type, which are suitable for pattern classification (as opposed to other ANN models used in clustering or regression). Each perceptron is a nonlinear model whose inputs are the outputs of the previous layer. Performance of an ANN is heavily dependent on its structure. The number of layers and number of neurons in each layer need to be determined for each particular problem by an optimization procedure (often trial-and-error). ANN models are usually used as black boxes as there is no easy interpretation of their structures to determine the link between inputs and class outcomes.26 ANN classification times are fixed and dependent only on the model structure and not on the training size. Support Vector Machines (SVM)

The SVM classifier is an aggregation of 2-class support vector classifiers and the final class is decided by majority voting. In training mode, each single 2-class classifier tries to find two parallel hyperplanes, forming a wall, between which no training data point exists, and at either side of which lies all the points from each class. The distance between these two planes is maximized in training, reducing the misclassification rate. The data points that touch the wall at either side are called support vectors. SVMs are originally linear classifiers; however, the subsequent application of a mathematical formulation called the kernel trick to SVMs gave rise to nonlinear classifiers.30 SVM classification times increase proportionally to the number of samples in the training set.

REVIEWS

5. COMBINING BIOSPECTROSCOPY WITH COMPUTATIONAL ANALYSIS Several combined techniques offer the prospect of disease characterization and detection. In this section, we review biospectroscopy-based techniques and studies that have been coupled with multivariate analysis for the identification and increased understanding of disease progression. Exploratory Analysis

Exploratory analysis may facilitate identification of specific regions of wavenumbers (or single wavenumbers) related to a difference between classes, that is, biomarker extraction.44 Mechanistic information may also be identified, for example, a progression pattern of classes in scores plots may indicate stages in a process, while loadings/cluster vectors may reveal biochemical differences, for example, SC versus TA cells versus TD cells.45,46 Exploratory analysis employed in the investigation of cervical cancer has shown that ATR-FTIR spectroscopy has potential as a diagnostic tool to complement the current screening program.47-49 PCA-LDA was able to distinguish spectra acquired from benign and precancerous specimens, facilitating the prospect of a predictive tool.50 Identification of SCs, TA cells and TD cells in the cornea and gastrointestinal tract has been suggested by synchrotron-based FTIR microspectroscopy coupled with PCA or PCA-LDA.45,51,52 In the gastrointestinal tract, 1080 cm-1 (νsPO2-) has been suggested as a SC marker.45,46 Furthermore, the differentiation progression in the gastrointestinal tract could be tracked following computational analysis (PCA or PCA-LDA) of the acquired spectra in large bowel and small intestine.45 Synchrotron-based FTIR microspectroscopy coupled with PCA-LDA also facilitated the identification of an underlying biochemical difference between endometrial carcinoma patients within each carcinoma subtype.53 The use of synchrotron-based FTIR microspectroscopy allowed high spatial-resolving power to distinguish single cells. Additional applications of PCA to spectroscopy data include biochemical profiling of bacteria, differentiation of drug-resistant and nonresistant human melanoma cell lines and understanding composition of oocytes.55,56 A recent protocol provides information on how to handle biological material with a view to spectroscopy followed by multivariate analysis.57 The identification of cells and the spectral changes detected as a result of solvent-induced chemical changes suggest Raman and FTIR spectroscopy coupled with PCA as novel tools in drug delivery assessments. Further applications of PCA or PCA-LDA to spectroscopic data include the identification of nonmelanoma skin cancer and malignant gliomas.58-60 Additionally, ATR-FTIR spectroscopic data analyzed by PCA-LDA has been used to detect differences between human breast cancer cell lines in different phases of the cell cycle and the response following the treatment of single or binary mixtures of chemicals, at environmental exposure levels.35,35 The application of Raman spectroscopy analyzed by PCA on tissue at different stages of esophageal carcinoma has shown that biochemical changes in the cells are detectable prior to morphological changes.61 This study showed the potential of Raman to be used in vivo and the strong diagnostic capabilities of early detection of high-risk patients.61 Raman is also able to distinguish between apoptotic-induced cells and gastric carcinoma when analyzed with PCA.62 PTMS is a relatively new spectroscopic tool and not commercially available; all of the studies using this technique thus far used exploratory analysis, for example, cell cycle monitoring,63 and corneal SC identification and extraction of associated biomarkers.51,64 1444

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research Cluster Analysis and Imaging

Cluster analysis is particularly useful for extracting morphological information from large image maps/images and for the identification of classes within the data, especially where little is known about the samples. In particular, cell-type identification was achieved with FTIR spectroscopy coupled with HCA from human colorectal samples, which as the class labels have been established facilitates the use of a classifier, for example, ANN.65 Currently, IR imaging techniques are commonly used to identify cells within tissues for the study of tissue architecture and identification of cancerous cells.66,67 It is possible to identify the distribution of chemical components across a sample, which, relative to molecular techniques image analysis could be considerably more efficient. Once wavenumbers have been identified, the distribution of the biochemicals related to the identified biomarker may be viewed across any image, for example, live cancer cells.44,68 Image analysis has found low levels of particular nucleic acids and lipids to be associated with pancreatic cancer tissue.69 A combination of FTIR spectroscopy and HCA detected malignancy in peritoneal tissue which associated mainly within the 1300-905 cm-1 region.70 A further application of FTIR microspectroscopy uses synchrotron-based mapping and IR imaging to identify three regions of a mouse blastocyst: secondary protein conformation and lipid content were found to contribute the most to the differences between the regions, identified by HCA and PCA.71 Raman spectra were acquired using imaging technology from single cancer cells variously prepared compared to living cells;72 using HCA and PCA, it was shown that formalin-fixation and cytocentrifugation have minimal effect on cell biochemistry. Pseudocolor images of HCA clusters allowed biological comparisons across images as well as providing an indication of the similarity of spectra between sample preparation classes.71 One PTMS study aimed at determining if the spectral information obtained was distinguishable based on the stage of the cell cycle: HCA performed on the data could distinguish between these cell populations and the effect of lindane treatment.63 Classification

The current cervical screening program could potentially be supported by ATR-FTIR spectroscopy coupled to eClass for the identification of normal smear results distinct from precancerous lesions.50 The prediction of an unknown sample based on the knowledge acquired from known samples enables IR spectroscopy to be used as a novel diagnostic tool in a clinical setting. Here, eClass predicts the class membership of each spectrum within an unknown cervical data set, using the training set of labeled cervical data. Although, biologically this methodology has proved useful in the classification of cervical cancer, it could be applied and tested with other data sets. Additionally, through interpretation of the membership of each spectrum additional information of the progression/ regression of atypical regions can be explored. The adaptive characteristic of eClass allows the addition of new data without running the algorithm from the beginning whereas the majority of alternative classifiers would run the data through the algorithm from the beginning. Applications of eClass include atmospheric monitoring of nitrate levels, cervical cancer grade predictions using ATR-FTIR spectroscopy, and robotics.43 Diseases such as Alzheimer’s have been studied in an attempt to discover vital information on disease progression using FTIR spectroscopy and ANN.73 Extensive research with FTIR microspectroscopy coupled with ANN has led to the identification of differences between protein structure of PrPc, the

REVIEWS

cellular prion protein commonly produced by neurons, and PrPSc, the disease prion protein found in the central nervous system of humans and animals with transmissible spongiform encephalitis. These changes failed to be detected by either a PET blot or a Western blot.13 IR imaging data of axillary lymph nodes of breast cancer patients, vitally removed and tested for an indication of disease recurrence, was processed into clusters by HCA, following which the data was subdivided into normal or malignant classes and trained with ANN.74 This work highlights the use of spectral imaging for accurate diagnosis when coupled with appropriate and robust computational methodologies. In a study using Raman spectroscopy in pursuit of a diagnostic tool for colon cancer SVM was applied following the data reduction by PCA.75 Raman spectroscopy coupled with SVM has potential in cancer diagnosis of the colon. Furthermore, SVM has been used for the classification of IR spectra of cervical cancer screening, with specificity and sensitivity higher than the current screening tests.76

6. CONCLUSIONS AND OUTLOOK As nondestructive tools, these combined techniques offer a powerful complementary methodology to conventional cancer screening tests, particularly as in vivo testing is possible with preclinical diagnosis a real possibility. Biospectroscopy presents a unique opportunity to consider an interrogated sample as a whole, on a biochemical level, as it acquires biochemical-cell fingerprints containing information from all components of the sample. Differences in spectra among data classes are often visible, but subtle differences could be missed or interpretation flawed from unidentified spectral variation; in general, the data produced from spectroscopic techniques is too complex and large to be analyzed “by eye”. While examining the various data collection techniques of spectroscopy and the computational analysis available it becomes apparent that a suitable combination of both approaches would be the most effective way to interpret the spectra while being unbiased, reliable, quick and robust (Figure 1). Thus, the field of biospectroscopy is clearly multidisciplinary, with knowledge of biological, physical and statistical aspects of the multistep process being equally important for successful experimental design and reliable results, so much that the term “biospectroscopy” may be regarded as encompassing sample preparation, instrumentation and computational analysis alike, and with all the details within. Biospectroscopy is a novel and powerful approach to characterization of samples based on their biochemical-cell fingerprints. Multivariate analysis facilitates biomarker extraction, sample clustering and classification from spectral data sets. The extraction of biomarkers is vital for the unbiased interpretation of the inter- and intraclass variance, which may lead to disease characterization and diagnosis. ’ AUTHOR INFORMATION Corresponding Author

*Tel.: þ44 1524 510206. E-mail: [email protected].

’ ACKNOWLEDGMENT This work was funded by the Rosemere Cancer Foundation and Unilever. ’ ABBREVIATIONS ANN, artificial neural networks; ATR-FTIR spectroscopy, attenuated total reflection Fourier-transform infrared spectroscopy; eClass, 1445

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research evolving classifier; FT, Fourier-transform; FCM, fuzzy c-means; HCA, hierarchical cluster analysis; IR, infrared; KM, k-means clustering; k-NN, k-nearest neighbor; LDA, linear discriminant analysis; PLS, partial least-squares; PTMS, photothermal microspectroscopy; PCA, principal component analysis; SNR, signal-to-noise ratio; SC, stem cell; SVM, support vector machines; TD cells, terminallydifferentiated cells; TA cells, transient-amplifying cells

’ REFERENCES (1) Heraud, P.; Tobin, M. J. The emergence of biospectroscopy in stem cell research. Stem Cell Res. 2009, 3, 12–14. (2) Stuart, B. Infrared spectroscopy: fundamentals and applications, 1st ed.; John Wiley and Sons Inc.: Chichester, England, 2005. (3) Griffiths, P. R. de Haseth, J. A. Fourier Transform Infrared spectroscopy, 2nd ed.; John Wiley and Sons, Inc., Hoboken, NJ, 2007. (4) Martin, F. L.; Patel, I. I.; Sozeri, O.; Singh, P. B.; Ragavan, N.; Nicholson, C. M.; Frei, E.; Meinl, W.; Glatt, H.; Phillips, D. H.; Arlt, V. M. Constitutive expression of bioactivating enzymes in normal human prostate suggests a capability to activate pro-carcinogens to DNA-damaging metabolites. Prostate 2010, 70, 1586–1599. (5) Patel, I. I.; Martin, F. L. Discrimination of zone-specific spectral signatures in normal human prostate using Raman spectroscopy. Analyst 2010, 135, 3060–3069. (6) Harvey, T. J.; Gazi, E.; Henderson, A.; Snook, R. D.; Clarke, N. W.; Brown, M.; Gardner, P. Factors influencing the discrimination and classification of prostate cancer cell lines by FTIR microspectroscopy. Analyst 2009, 134, 1083–1091. (7) Wang, L.; Mizaikoff, B. Application of multivariate data-analysis techniques to biomedical diagnostics based on mid-infrared spectroscopy. Anal. Bioanal. Chem. 2008, 391, 1641–1654. (8) van den Berg, R. A.; Hoefsloot, H. C. J.; Westerhuis, J. A.; Smilde, A. K.; van der Werf, M. J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 2006, 7, 142–157. (9) Eklund, A. C.; Szallasi, Z. Correction of technical bias in clinical microarray data improves concordance with known biological information. Genome Biol. 2008, 9, R26. (10) LeBoeuf, R. A.; Kerckert, G. A.; Aardema, M. J.; Gibson, D. P.; Brauninger, R.; Isfort, R. J. The pH 6.7 Syrian hamster embryo cell transformation assay for assessing the carcinogenic potential of chemicals. Mutat. Res. 1996, 356, 85–127. (11) Walsh, M. J.; Bruce, S. W.; Pant, K.; Carmichael, P. L.; Scott, A. D.; Martin, F. L. Discrimination of a transformation phenotype in Syrian golden hamster embryo (SHE) cells using ATR-FTIR spectroscopy. Toxicology 2009, 258, 33–38. (12) Haka, A. S.; Volynskaya, Z.; Gardecki, J. A.; Nazemi, J.; Lyons, J.; Hicks, D.; Fitzmaurice, M.; Dasari, R. R.; Crowe, J. P.; Feld., M. S. In vivo margin assessment during partial mastectomy breast surgery using Raman spectroscopy. Cancer Res. 2006, 66, 3317–3322. (13) Beekes, M.; Lasch, P.; Naumann, D. Analytical applications of Fourier transform- infrared (FT-IR) spectroscopy in microbiology and prion research. Vet. Microbiol. 2007, 123, 305–319. (14) Levin, I. W.; Bhargava, R. Fourier transform infrared vibrational spectroscopic imaging: Integrating Microscopy and Molecular Recognition. Annu. Rev. Phys. Chem. 2005, 56, 429–474. (15) Martin, F. L.; Pollock, H. M. In Oxford Handbook of Nanoscience and Technology; Narlikar, A. V., Fu, Y. Y., Eds.; Oxford University Press: UK, 2010; Vol. 2, Chapter 8, pp 285-336. (16) Kazarian, S. G.; Chan, K. L. Applications of ATR-FTIR spectroscopic imaging to biomedical samples. Biochim. Biophys. Acta 2006, 1758, 858–865. (17) Baena, J. R.; Lendl, B. Raman spectroscopy in chemical bioanalysis. Curr. Opin. Chem. Biol. 2004, 8, 534–539. (18) Matousek, P.; Morris, M. D.; Everall, N.; Clark, I. P.; Towrie, M.; Draper, E.; Goodship, A.; Parker, A. W. Numerical simulations of

REVIEWS

subsurface probing in diffusely scattering media using spatially offset Raman spectroscopy. Appl. Spectrosc. 2005, 59, 1485–1492. (19) Jain, A.; Duin, P. Statistical pattern recognition: a review. IEEE T. Pattern Anal. 2000, 22, 4–37. (20) Bassan, P.; Byrne, H. J.; Bonnier, F.; Lee, J.; Dumas, P.; Gardner, P. Resonant Mie scattering in infrared spectroscopy of biological materials - understanding the ‘dispersion artefact ’. Analyst 2009, 1, 1586–1593. (21) Bassan, P.; Kohler, A.; Martens, H.; Lee, J.; Jackson, E.; Lockyer, N.; Dumas, P.; Brown, M.; Clarke, N.; Gardner, P. RMieS-EMSC correction for infrared spectra and GPU computing. J. Biophotonics 2010, 3, 609–20. (22) Somorjai, R.; Dolenko, B.; Baumgartner, R. Mapping highdimensional data onto a relative distance plane--an exact method for visualizing and characterizing high-dimensional patterns. Bioinformatics 2003, 19, 1484–1491. (23) Guyon, I.; Elisseefi, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. (24) Cristianini, N.; Shawe-Taylor, J. An introduction to support vector machines; Cambridge University Press: New York, 2004. (25) Ni, W.; Brown, S. D.; Man, R. Wavelet orthogonal signal correction-based discriminant analysis. Anal. Chem. 2009, 81, 8962– 8967. (26) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer: New York, 2010. (27) Kohavi, R.; John, G. H. Wrappers for feature subset selection. Art. Intell. 1997, 97, 273–324. (28) Polat, K.; G€unes, S. Principles component analysis, fuzzy weighting pre-processing and artificial immune recognition system based diagnostic system for diagnosis of lung cancer. Expert Syst. Appl. 2008, 34, 214–221. (29) Korenius, T.; Laurikkala, J.; Juhola, M. On principal component analysis, cosine and Euclidean measures in information retrieval. Inform. Sci. 2007, 177, 4893–4905. (30) Kemsley, E .K. Discriminant analysis of high-dimensional data: a comparison of principal components analysis and partial least squares data reduction methods. Chemom. Intell. Lab. Syst. 1996, 33, 47–61. (31) Fearn, T. Handbook of Vibrational Spectroscopy; Chalmers, J. M., Griffiths, P. R, Eds.; Wiley Publications: Chichester, 2002. (32) German, M. J.; Hammiche, A.; Ragavan, N.; Tobin, M. J.; Cooper, L. J.; Matanhelia, S. S.; Hindley, A. C.; Nicholson, C. M.; Fullwood, N. J.; Pollock, H. M.; Martin, F. L. Infrared Spectroscopy with Multivariate Analysis Potentially Facilitates the Segregation of Different Types of Prostate Cell. Biophys. J. 2006, 90, 3783–3795. (33) Martin, F. L.; German, M. J.; Wit, E.; Fearn, T.; Ragavan, N.; Pollock, H. M. Identifying variables responsible for clustering in discriminant analysis of data from infrared microspectroscopy of a biological sample. J. Comput. Biol. 2007, 14, 1176–1184. (34) Llabjani, V.; Jones, K. C.; Thomas, G. O.; Walker, L. A.; Shore, R. F.; Martin, F. L. Polybrominated diphenyl ether-associated alterations in cell biochemistry as determined by attenuated total reflection Fouriertransform infrared spectroscopy: a comparison with DNA-reactive and/ or endocrine disrupting agents. Environ. Sci. Technol. 2009, 43, 3356– 3364. (35) Llabjani, V.; Trevisan, J.; Jones, K. C.; Shore, R. F.; Martin, F. L. Binary mixture effects by PBDE and PCB congeners (126 or 153) in MCF-7 cells: biochemical alterations assessed by IR spectroscopy and multivariate analysis. Environ. Sci. Technol. 2010, 44, 3992–3998. (36) Wang, X. Y.; Garibaldi, J. M.; Bird, B.; George, M. In Advances in Fuzzy Clustering and its Applications; Oliveira, J. V., Pedricz, W., Eds.; Wiley: Chichester, 2007; pp 405-425. (37) Pe~ na, J. M.; Lozano, J. A.; Larra~ naga, P. An empirical comparison of four initialisation methods for the K-Means algorithm. Pattern Recogn. Lett. 1999, 20, 1027–1040. (38) Lasch, P.; Haensch, W.; Naumann, D.; Diem, M. Imaging of colorectal adenocarcinoma using FT-IR microspectroscopy and cluster analysis. Biochem. Biophys. Acta. 2004, 1688, 176–186. 1446

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research (39) Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern Classification, 2nd ed.; John Wiley & Sons: New York, 2001. (40) Iglesias, J. A.; Angelov, P.; Ledezema, A.; Sanchis, A. Modelling Evolving User Behaviours, IEEE Symposium Series on Computational Intelligence, Nashville, TN, USA; IEEE Press, ISBN: 978-1-4244-2754-3: 16-23 , 2009. (41) Angelov, P. Machine Learning (Collaborative Systems) patent (WO2008053161, priority date: 1 November 2006; international filing date 23 October 2007). (42) Angelov, P.; Zhou, X. Evolving Fuzzy-Rule-Based Classifiers From Data Streams. IEEE Trans. Fuzzy Syst. 2008, 16, 1462–1475. (43) Angelov, P.; Filev, D. An approach to online identification of Takagi-Sugeno fuzzy models. IEEE Trans. Syst., Man Cybern., part B 2004, 34, 484–498. (44) Petibois, C.; Deleris, G. Chemical mapping of tumor progression by FT-IR imaging: towards molecular histopathology. Trends. Biotechnol. 2006, 24, 455–62. (45) Walsh, M. J.; Fellous, T. G.; Hammiche, A.; Lin, W. R.; Fullwood, N. J.; Grude, O.; Bahrami, F.; Nicholson, J. M.; Cotte, M.; Susini, J.; Pollock, H. M.; Brittan, M.; Martin-Hirsch, P. L.; Alison, M. R.; Martin, F. L. Fourier transform infrared microspectroscopy identifies symmetric PO2- modifications as a marker of the putative stem cell region of human intestinal crypts. Stem Cells 2008, 26, 108–118. (46) Walsh, M. J.; Hammiche, A.; Fellous, T. G.; Nicholson, J. M.; Cotte, M.; Susini, J.; Fullwood, N. J.; Martin-Hirsch, P. L.; Alison, M. R.; Martin, F. L. Tracking the cell hierarchy in the human intestine using biochemical signatures derived by mid-infrared microspectroscopy. Stem Cell Res. 2009, 3, 15–27. (47) Walsh, M. J.; German, M. J.; Singh, M.; Pollock, H. M.; Hammiche, A.; Kyrgiou, M.; Stringfellow, H. F.; Paraskevaidis, E.; Martin-Hirsch, P. L.; Martin, F. L. IR microspectroscopy: potential applications in cervical cancer screening. Cancer Lett. 2007, 246, 1–11. (48) Walsh, M. J.; Singh, M. N.; Pollock, H. M.; Cooper, L. J.; German, M. J.; Stringfellow, H. F.; Fullwood, N. J.; Paraskevaidis, E.; Martin-Hirsch, P. L.; Martin, F. L. ATR microspectroscopy with multivariate analysis segregates grades of exfoliative cervical cytology. Biochem. Biophys. Res. Commun. 2007, 352, 213–219. (49) Walsh, M. J.; Singh, M. N.; Stringfellow, H. F.; Pollock, H. M.; Hammiche, A.; Grude, O.; Fullwood, N. J.; Pitt, M. A.; Martin-Hirsch, P. L.; Martin, F. L. FTIR microspectroscopy coupled with two-class discrimination segregates markers responsible for inter- and intracategory variance in exfoliative cervical cytology. Biomark. Insights 2008, 3, 179–189. (50) Kelly, J. G.; Angelov, P.; Walsh, M. J.; Pollock, H. M.; Pitt, M. A.; Martin-Hirsch, P. L.; Martin, F. L. Intelligent interrogation of mid-IR spectroscopy data from exfoliative cervical cytology using selflearning classifier eClass. Int. J. Comput. Intell. Res. 2008, 4, 392–401. (51) Bentley, A. J.; Nakamura, T.; Hammiche, A.; Pollock, H. M.; Martin, F. L.; Kinoshita, S.; Fullwood, N. J. Characterization of human corneal stem cells by synchrotron infrared micro-spectroscopy. Mol. Vis. 2007, 13, 237–242. (52) Nakamura, T.; Kelly, J. G.; Trevisan, J.; Cooper, L. J.; Bentley, A. J.; Carmichael, P. L.; Scott, A. D.; Cotte, M.; Susini, J.; Martin-Hirsch, P. L.; Kinoshita, S.; Fullwood, N. J.; Martin, F. L. Microspectroscopy of spectral biomarkers associated with human corneal stem cells. Mol. Vis. 2010, 16, 359–68. (53) Kelly, J. G.; Singh, M. N.; Stringfellow, H. F.; Walsh, M. J.; Nicholson, J. M.; Bahrami, F.; Ashton, K. M.; Pitt, M. A.; Martin-Hirsch, P. L.; Martin, F. L. Derivation of a subtype-specific biochemical signature of endometrial carcinoma using synchrotron-based Fourier-transform infrared microspectroscopy. Cancer Lett. 2009, 274, 208–217. (54) Winder, C. L.; Goodacre, R. Comparison of diffuse-reflectance absorbance and attenuated total reflectance FT-IR for the discrimination of bacteria. Analyst 2004, 129, 1118–1122. (55) Zwielly, A.; Gopas, J.; Brkic, G.; Mordechai, S. Discrimination between drug-resistant and non-resistant human melanoma cell lines by FTIR spectroscopy. Analyst 2009, 134, 294–300.

REVIEWS

(56) Wood, B. R.; Chernenko, T.; Matth€aus, C.; Diem, M.; Chong, C.; Bernhard, U.; Jene, C.; Brandli, A. A.; McNaughton, D.; Tobin, M. J.; Trounson, A.; Lacham-Kaplan, O. Shedding new light on the molecular architecture of oocytes using a combination of synchrotron Fourier transform-infrared and raman spectroscopic mapping. Anal. Chem. 2008, 80, 9065–9072. (57) Martin, F. L.; Kelly, J. G.; Llabjani, V.; Martin-Hirsch, P. L.; Patel, I. I.; Trevisan, J.; Fullwood, N. J.; Walsh, M. J. Distinguishing cell types or populations based on the computational analyses of their infrared spectra. Nat. Protoc. 2010, 5, 1748–1760. (58) Zhang, G.; Moore, D. J.; Flach, C. R.; Mendelsohn, R. Vibrational microscopy and imaging of skin: from single cells to intact tissue. Anal. Bioanal. Chem. 2007, 387, 1591–1599. (59) Ly, E.; Piot, O.; Durlach, A.; Bernard, P.; Manfait, M. Differential diagnosis of cutaneous carcinomas by infrared spectral microimaging combined with pattern recognition. Analyst 2009, 134, 1208– 1214. (60) Krafft, C.; Sobottka, S. B.; Geiger, K. D.; Schackert, G.; Salzer, R. Classification of malignant gliomas by infrared spectroscopic imaging and linear discriminant analysis. Anal. Bioanal. Chem. 2007, 387, 1669– 1677. (61) Shetty, G.; Kendall, C.; Shepherd, N.; Stone, N.; Barr, H. Raman spectroscopy: elucidation of biochemical changes in carcinogenesis of oesophagus. Br. J. Cancer 2006, 94, 1460–1464. (62) Yao, H.; Tao, Z.; Ai, M.; Peng, L.; Wang, G.; He, B.; Li, Y. Vibrational spectroscopy Raman spectroscopic analysis of apoptosis of single human gastric cancer cells. Vib. Spectrosc. 2008, 50, 193– 197. (63) Hammiche, A.; German, M. J.; Hewitt, R.; Pollock, H. M.; Martin, F. L. Monitoring cell cycle distributions in MCF-7 cells using near-field photothermal microspectroscopy. Biophys. J. 2005, 88, 3699– 3706. (64) Grude, O.; Hammiche, A.; Pollock, H.; Bentley, A. J.; Walsh, M. J.; Martin, F. L.; Fullwood, N. J. Near-field photothermal microspectroscopy for adult stem-cell identification and characterization. J. Microscopy 2007, 228, 366–372. (65) Lasch, P.; Diem, M.; H€ansch, W.; Naumann, D. Artificial neural networks as supervised techniques for FT-IR microspectroscopic imaging. J. Chemometrics 2007, 20, 209–220. (66) Bhargava, R.; Fernandez, D. C.; Hewitt, S. M.; Levin, I. W. High throughput assessment of cells and tissues: Bayesian classification of spectral metrics from infrared vibrational spectroscopic imaging data. Cancer 2006, 1758, 830–845. (67) Mohlenhoff, B.; Romeo, M.; Diem, M.; Wood, B. R. Mie-type scattering and non-Beer-Lambert absorption behaviour of human cells in infrared microspectroscopy. Biophys. J. 2005, 88, 3635–3640. (68) Kuimova, M. K.; Chan, K. L.; Kazarian, S. G. Chemical imaging of live cancer cells in the natural aqueous environment. Appl. Spectrosc. 2009, 63, 164–171. (69) Chen, Y.; Cheng, Y.; Liu, H.; Lin, P.; Wang, C. Observation of biochemical imaging changes in human pancreatic cancer tissue using Fourier-transform infrared microspectroscopy. Chang Gung Med. J. 2005, 29, 518–527. (70) Untereiner, V.; Piot, O.; Diebold, M. D.; Bouche, O.; Scaglia, E.; Manfait, M. Optical diagnosis of peritoneal metastases by infrared microscopic imaging. Anal. Bioanal. Chem. 2009, 393, 1619–1627. (71) Thumanu, K.; Tanthanuch, W.; Lorthingpanich, C.; Heraud, P.; Parnpai, R. FTIR microspectroscopic imaging as a new tool to distinguish chemical composition of mouse blastocyst. J. Mol. Struct. 2009, 933, 104–111. (72) Draux, F.; Gobinet, C.; Sule-Suso, J.; Trussardi, A.; Manfait, M.; Jeannesson, P.; Sockalingum, G. D. Raman spectral imaging of single cancer cells: probing the impact of sample fixation methods. Anal. Bioanal. Chem. 2010, 397, 2727–2737. (73) Griebe, M.; Daffertshofer, M.; Stroick, M.; Syren, M.; AhmadNejad, P.; Neumaier, M.; Backhaus, J.; Hennerici, M. G.; Fatar, M. Infrared spectroscopy: a new diagnostic tool in Alzheimer disease. Neurosci. Lett. 2007, 420, 29–33. 1447

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448

Journal of Proteome Research

REVIEWS

(74) Bird, B.; Miljkovic, M.; Romeo, M. J.; Smith, J.; Stone, N.; George, M. W.; Diem, M. Infrared micro-spectral imaging: distinction of tissue types in axillary lymph node histology. BMC Clin. Path. 2008, 8, 8. (75) Widjaja, E.; Zheng, W.; Huang, Z. Classification of colonic tissues using near-infrared Raman spectroscopy and support vector machines. Int. J. Oncol. 2008, 32, 653–662. (76) Njoroge, E.; Alty, S. R.; Gani, M. R.; Alkatib, M. Classification of cervical cancer cells using FTIR data. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2006, 1, 5338–5341.

1448

dx.doi.org/10.1021/pr101067u |J. Proteome Res. 2011, 10, 1437–1448