Prediction and Classification of Drug Toxicity Using ... - ACS Publications

May 22, 2007 - Elaine Holmes, and Jeremy K. Nicholson* ... Metabonomic Toxicology (COMET) project, a substantial metabolic and pathological database ...
1 downloads 0 Views 468KB Size
Prediction and Classification of Drug Toxicity Using Probabilistic Modeling of Temporal Metabolic Data: The Consortium on Metabonomic Toxicology Screening Approach Timothy M. D. Ebbels,* Hector C. Keun, Olaf P. Beckonert, Mary E. Bollard, John C. Lindon, Elaine Holmes, and Jeremy K. Nicholson* Department of Biomolecular Medicine, Division of Surgery, Oncology, Reproductive Biology and Anaesthetics, Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, South Kensington, London SW7 2AZ, United Kingdom Received May 22, 2007

Detection and classification of in vivo drug toxicity is an expensive and time-consuming process. Metabolic profiling is becoming a key enabling tool in this area as it provides a unique perspective on the characterization and mechanisms of response to toxic insult. As part of the Consortium on Metabonomic Toxicology (COMET) project, a substantial metabolic and pathological database was constructed. We chose a set of 80 treatments to build a modeling system for toxicity prediction using NMR spectroscopy of urine samples (n ) 12935) from laboratory rats (n ) 1652). The compound structures and activities were diverse but there was an emphasis on the selection of hepato and nephrotoxins. We developed a two-stage strategy based on the assumptions that (a) adverse effects would produce metabolic profiles deviating from those of normal animals and (b) such deviations would be similar for treatments having similar physiological effects. To address the first stage, we developed a multivariate model of normal urine, using principal components analysis of specially preprocessed 1H NMR spectra. The model demonstrated a high correspondence between the occurrence of toxicity and abnormal metabolic profiles. In the second stage, we extended a density estimation method, “CLOUDS”, to compute multidimensional similarities between treatments. Crucially, the technique allowed a distribution-free estimate of similarity across multiple animals and time points for each treatment and the resulting matrix of similarities showed segregation between liver toxins and other treatments. Using the similarity matrix, we were able to correctly identify the target organ of two “blind” treatments, even at sub-toxic levels. To further validate the approach, we then applied a leave-one-out approach to predict the main organ of toxicity (liver or kidney) showing significant responses using the three most similar matches in the matrix. Where predictions could be made, there was an error rate of 8%. The sensitivities to liver and kidney toxicity were 67 and 41%, respectively, whereas the corresponding specificities were 77 and 100%. In some cases, it was not possible to make predictions because of interference by drug-related metabolite signals (18%), an inconsistent histopathological or urinary response (11%), genuine class overlap (8%), or lack of similarity to any other treatment (2%). This study constitutes the largest validation to date of the metabonomic approach to preclinical toxicology assessment, confirming that the methodology offers practical utility for rapid in vivo drug toxicity screening. Keywords: metabonomics • toxicity prediction • 1H NMR • metabolic profile • CLOUDS • modeling • COMET project • biofluids • urine • rats • toxicology • screening • toxin likeness

Introduction Few of the millions of drug candidates synthesized each year reach the clinic. The loss of compounds from the discovery/ development pipeline is known as attrition and apart from the economic losses, inefficiencies in screening can lead to can* To whom correspondence should be addressed. E-mail: t.ebbels@ imperial.ac.uk, [email protected]. Telephone, +44 (20) 7594 3160; Fax, +44 (20) 7594 3226. 10.1021/pr0703021 CCC: $37.00

 2007 American Chemical Society

didate drug failure later in development or even after the product release stage, thus posing a safety hazard.1 Of those having appropriate pharmacological activity, most fail at the preclinical stage for safety reasons. It is therefore extremely important to efficiently screen compounds for safety and in vivo rodent models are standard tools for this investigation. Conventional methods of safety assessment include histopathology and clinical chemistry, but these are costly and timeconsuming, monitoring a limited number of parameters at few Journal of Proteome Research 2007, 6, 4407-4422

4407

Published on Web 10/04/2007

research articles time points. However, in recent years, new techniques have been brought to bear on the problem, particularly the “omics” technologies, based on the analysis of gene transcription, protein and metabolic profiles. The metabonomic approach to screening2,3 holds great potential in this field. In this methodology, animals are subjected to treatments generating typical toxicological lesions and the metabolite composition of their biofluids analyzed by spectroscopic or spectrometric methods. Nuclear magnetic resonance (NMR) spectroscopy has been successfully used in this area for many years4-7 and has recently been complemented by application of liquid chromatography-mass spectrometry.8 The resulting spectra (NMR or LC-MS) provide fingerprints of the metabolic state of the organism under toxicological stress2,9 and this can be used to predict the likely toxicological profile of new compounds by comparison with a sufficiently large database of similar spectra. This approach generates new challenges for the bioinformatic and statistical pattern recognition techniques used to process and model such data. The use of any “omic” profile to model toxicity presents several difficulties, which must be addressed by a comprehensive and integrated data treatment strategy. For example, individual animals and even entire cohorts can respond with different speeds and magnitude to identical treatments, even when the gross toxicity is similar. Indeed, such variation is the basis of pharmaco-metabonomics studies.10 Specialized techniques have been developed which successfully take account of these variations allowing analyses to focus on common patterns.11,12 For urine, another source of variation is volume, which leads to changes in the overall signal level from all metabolites in the mixture. This necessitates a normalization step in data preprocessing.13 In toxicology studies, strong signals from the administered xenobiotics and their metabolites are often seen in urinary profiles (here termed drug related compounds, DRCs). These interferences, which can occur in NMR or MS based studies, can obscure important endogenous signals and must be carefully removed in order that the resulting classifiers recognize signs of toxicity rather than patterns specific to an individual treatment.14 This effect results in a minority of treatments which cannot be classified with a sufficient degree of confidence. Perhaps most importantly, it is difficult to calibrate models based on molecular profiles (in this case derived from urinary metabonomics) with traditional methods such as histopathology. In contrast to the continuous, individual-specific time course of response obtainable with urine sampling, histopathology provides a single “snapshot” of tissue status which cannot be indicative of the full evolution of the toxic lesion. Indeed, using standard toxicological sampling, it is possible to miss a lesion entirely if the sampling time points are inappropriate. By definition, novel therapeutic agents have unknown toxicity time profiles and the selection of testing time points is effectively arbitrary. Hence, methods that utilize multiple time points (such as noninvasive assays) are less likely to miss metabolic fluctuations that indicate the specific location of functional lesions. Further, in common with most “omic” approaches, metabolic profiles report on molecular mechanisms rather than the macroscopic tissue pathology, which is the result of a cascade of post mechanism failures. Despite these difficulties, in this work we have used histopathology to delineate the classical toxicity descriptor of each treatment, because of its widely accepted use in assessing preclinical toxicity and human pathology. Nonetheless, this strategy can still result in a proportion of treatments which are unclassifiable 4408

Journal of Proteome Research • Vol. 6, No. 11, 2007

Ebbels et al.

because the lesion observable by histopathology in a particular study is not that expected from the literature, is ambiguous or is not synchronized in time or magnitude with the biochemical effect observed by urinary metabonomics. Overall, for classification models to be relevant in real preclinical testing regimes, a large dataset with a wide range of toxicities must be used. Any given model will concentrate on a subset of toxicities (perhaps by limiting to particular sites or mechanisms). Therefore, it is important to include examples of treatments with effects outside this subset in the test set, to judge the real world performance of the prediction system, and this is the approach we take in this paper. Various approaches have been applied to classification of metabonomic data according to toxicity type, including multilayer perceptrons,16 soft independent modeling of class analogy (SIMCA),17 and probabilistic neural networks.18 Low error rates have been achieved with relatively modest numbers of toxins and samples. The COnsortium for MEtabonomic Toxicology (COMET)19 was set up to build on these earlier efforts, assembling a large database of 1H NMR spectra from studies of model toxins and drugs. Inter-laboratory replicate studies have shown excellent analytical reproducibility,20 allowing the detection and characterization of diverse episodes of toxicity.21-24 Indeed, NMR based metabonomics is exceptionally reproducible, as shown in human population studies.25 Drugs and toxins can produce pharmacological and physiological effects as well as overt toxicity and an important part of the COMET project involved collection of data on commonly observed physiological and behavioral effects of treatments. The full listing of all treatments used in this study is given in Table 2. Indeed, it is known that some treatments cause temporary anorexia and food deprivation and pair matched feeding studies15 have shown that significant metabonomic effects of temporary food restriction can be offset by pair matched feeding experiments. We chose not to use pair feeding in the COMET studies for both scientific and practical reasons. First, a temporary food aversion is part of the global response of an organism to some types of toxicity, but the pattern of modulation of dietary restricted compounds is still dictated by toxin-influenced pathway alterations, such that two treatments both causing anorexic effects will still have distinct metabolic characteristics. This is evidenced by the fact that hepatotoxins of various types all have characteristic (toxin specific) pattern modulations as well as organ specific signatures.7 Furthermore, in the real world, pharmaceutical companies do not and can never do pair feeding experiments in routine toxicological screening because of the increased resource requirements. Hence, any system built using pair feeding models would not be practical in the screening field. The final goal of the COMET project was to build a system that not only identified treatments with toxic effects, but also suggested the likely type of toxicity, thereby assisting in the elucidation of toxic mechanisms. The COMET project has highlighted the ability of metabonomic approaches to follow each individual animal noninvasively throughout a whole time course. It has also underlined the importance of the geometry of the metabolic trajectory as a characteristic of different types of toxicity.11 Indeed, the time-resolved nature of many metabonomics studies has generated renewed interest in models capable of taking underlying factors such as time into account.18,26,27 Combining the data across all time points delineates the total group response which may then be compared as a unit with the signatures of other treatments. In this way,

research articles

Toxicity Prediction with Temporal Metabolic Data

interfering signals) are discarded and the toxicity profile of new treatments determined via their toxin likeness to those already in the database. We employed a set of 80 treatments from the COMET database to build and validate the metabonomic screening method. Performance was tested both by application to two blind studies and a leave-one-out cross validation approach; the latter was assessed for significance by a permutation test. The COMET screening method described here was delivered and practically implemented by several major pharmaceutical companies on completion of the project. The work constitutes the largest validation to date of the metabonomic approach in this area and the results indicate that the approach is successful and realistic as a screening tool for preclinical toxicology.

Methods

Figure 1. Schematic flow diagram of the construction of the COMET screening method. Numbers of samples and treatments are indicated at different stages of the process.

a similarity measure may be developed describing the “toxin likeness” of one compound to another. This idea has some similarities to the “drug likeness” concept,28 except that we compare in vivo metabolic trajectories rather than physicochemical properties of drugs. The method employed to compute such values of similarity must be one which can deal with multidimensional timeordered trajectories which are characteristic of the sequence of metabolic effects that accompany the development and regression of lesions. This is particularly important for treatments that act on multiple systems, pathways, or organs. Here, we extend the CLassification Of Unknowns by Density Superposition (CLOUDS) model23 to construct a measure of similarity between multidimensional probability densities representing the metabolic response to each treatment. The new model represents a step forward in our ability to capture and compare the characteristics of collections of metabolic trajectories. This similarity value can be used as an estimate of the toxin likeness between treatments, allowing the response to new treatments to be explained in terms of those already in the database. Figure 1 illustrates the construction of the “expert system” providing the metabonomic toxicity screen. After in-life studies and NMR spectroscopy of the resulting urine samples, the spectra are preprocessed and we construct a model which summarizes the urinary variation of physiologically normal animals. Multivariate metabolic trajectories from animals determined to be abnormal by this model are then suitably scaled and compared via the CLOUDS methodology. Unreliable similarities (due to

Animal Studies. The study protocol was designed to simulate as far as possible the experimental design used by drug companies in in vivo single dose “7 day” studies. In all cases, a broad range clinical chemistry screen and histopathological assessment was obtained for each study and each treatment was assessed separately, and summary effects reported prior to database construction. Eighty treatments covering a wide range of toxicities and physiological stressors were selected to explore a wide range of the metabolic parameter space, but with strong representation from kidney and liver toxicants because of their relative importance in drug failure as well as their diverse mechanisms of action.19,29 Full details of the treatments are listed in Table 2. In addition, two surgical models of organ damage and two studies of general metabolic stress (food and water restriction) were conducted. Four toxins were also selected whose primary toxicity affects other organs (pancreas and testes). Several of the treatments had complex effects that involved multiple organs and/or temporal differentiation between their effects; as discussed above, these were retained in the system in order to test the “real world” performance of the classifier. Studies were conducted in male Sprague-Dawley rats at seven sites (six in-house pharmaceutical company locations and a contract house) using a standardized protocol. For each study, 20 rats were randomly and equally assigned to two treatment groups (control and dosed). Each dosed group received a single acute administration of the compound, whereas each control group received the dosing vehicle only. Urine samples were collected at 0-8 and 8-24 h on the day prior to treatment, and further samples collected at 8, 24, 48, 72, 96, 120, 144, and 168 h (7 days) post-treatment. Each treatment group was split into two subgroups, one euthanased at 48 h post-dose and the other euthanased at 168 h post-dose for histopathological and clinical chemistry evaluation. Seven treatments, of necessity, had to deviate from the standard protocol, either because multiple doses had to be administered to induce the required pathology (acetaminophen), to maintain physiological disturbance (ammonium chloride, sodium bicarbonate), or because different numbers of animals were used (food and water restrictions). Several hydrazine dosing studies were performed as a measure of interlaboratory variability. Six of these were identical, one used a variation on the group numbers, and one was performed in another rat strain (Han-Wistar) for comparison purposes. All animal studies were carried out in accordance with national legislation and were subject to appropriate local review. In total, 12 935 urine samples (6260 control and 6675 dosed) were collected from 1652 animals (800 control and 852 dosed). Journal of Proteome Research • Vol. 6, No. 11, 2007 4409

research articles

Ebbels et al.

Figure 2. Generation and use of the normal model. (a) Typical 1H NMR spectrum of rat urine and (b) reduced spectrum. (c) Scores on the first 3 PCs of the normal model colored by treatment. (d) A schematic of the augmented residuals (including contributions from both PC scores and residual variation) of 3 dosed animals versus time. The dashed line indicates the threshold for samples to be considered “abnormal”. It is clear that, depending on individual responses, only some time points of each treatment will have more than 50% of samples classified as “abnormal”.

NMR Spectroscopy. All urine samples were buffered in sodium phosphate and combined with an internal reference (TSP) and anti-bacterial agent (sodium azide). 400 µL of each urine sample was added to 200 µL of sodium phosphate buffer (0.2 M Na2HPO4, 0.04 M NaH2PO4 in 80:20 H2O:D2O, pH 7.4) containing 1 mM TSP (sodium trimethylsilyl [2,2,3,3-2H4] propionate) and 9 mM sodium azide. All samples were centrifuged at ∼1800 g for 5 min to remove any solid debris. 1H NMR spectra were acquired at 600 MHz and 300 K using a robotic flow-injection system (Bruker Biospin, Karlsruhe, Germany) with suppression of the water resonance using a standard presaturation pulse sequence (90°-3µs-90°-mixing90°-acquire) with irradiation during a 2 s relaxation delay and also during the 100 ms mixing time. Spectral acquisition time was approximately 4 min per sample. Sixty-four transients were collected into 64 K data points using a spectral width of 20 ppm (12 kHz). Data Processing. Prior to Fourier transformation, an exponential line-broadening factor of 1 Hz was applied to each free induction decay. The resulting spectra were Fourier transformed using XWinNMR (Bruker Biospin, Karlsruhe, Germany) and phased, baseline corrected and referenced automatically using an in-house routine written in MATLAB (version 6.5, The MathWorks, Natick, Massachusetts). MATLAB was used for all subsequent procedures. For statistical analysis, each spectrum 4410

Journal of Proteome Research • Vol. 6, No. 11, 2007

was segmented into M ) 205 variables by integrating the signal in regions of equal width (0.04 ppm) in the chemical shift ranges δ0.20-4.50 and δ5.98-10.0, excluding spectral artifacts in the region δ4.50-5.98 (see Figure 2a, b). This segmentation procedure is well established as a means of smoothing minor fluctuations in chemical shifts due to pH and other ionic variations.30,31 Given the massive data sets used here, it is also much more computationally efficient to use the segmentation approach, in contrast to recent studies that analyze the spectra at their native high resolution.32 (Such high-resolution studies are not precluded because the primary database holds the spectra in their high resolution format.) All thus reduced spectra were then normalized to a constant integrated intensity of 100 units to take account of large variations in urine concentration. In 979 spectra, significant resonances from non-endogenous compounds (usually drug related compounds, DRCs) were observed, amounting to 4.5% of the total data matrix. In order to allow modeling to concentrate solely on signals from endogenous metabolites, these regions must be removed or replaced. Simple removal would entail deleting 162/205 (79%) of the spectral regions across all spectra, an unacceptably high figure. For this reason a “spectral replacement” protocol was developed,14 applied to the regions affected by DRCs and these regions then treated as missing values in subsequent analyses. The protocol was designed to obtain the correct normalization

research articles

Toxicity Prediction with Temporal Metabolic Data

of the remaining endogenous signals and operates by replacing the regions containing the nonendogenous signals with a suitably scaled segment from the mean control spectrum. In this way, the effect of large xenobiotic signals on the rest of the spectral profile is reduced. Multivariate Model of Urinary Profiles from “Physiologically Normal” Control Animals. Metabolic normality in any study is defined by the control population; this principle applies equally to experimental animal and human populations. To identify treatment responses and to assist in quality control, a normal model was constructed using spectra from the control group. A conventional approach to characterizing normal variation in metabolism studies is to define normal ranges for each metabolite observed. However, this ignores the presence of inter-metabolite correlations, due to homeostatic control processes, which are a strong feature of all metabolic profiles. We therefore opted for a multivariate approach using principal components analysis (PCA) implemented in SIMCA-P (version 9, Umetrics AB, Umea, Sweden). PCA allowed us to detect departures from normality in both metabolite levels and their correlation structure. In constructing such a normal model, we were mindful that some control animals consistently showed abnormal behavior (due to, for example, undiagnosed disease conditions, stress, etc.). We therefore aimed to construct a model which summarized the majority of control animal variation, whereas excluding some obviously abnormal animals. We also required that our model be robust to small data perturbations, indicating that the homeostatic processes captured were common to a large proportion of the population. To construct the model, we employed an iterative algorithm. First, using all samples from control animals, we built a PCA model, determining the number of components by 7-fold crossvalidation. We then excluded samples with a low probability of model membership (p < 0.05) and constructed a new PCA model with the new training set. This was repeated until the cross-validation predicted variance (Q2) reached a maximum, ensuring robustness to perturbations of the sample set. The training set and model with the highest Q2 was selected as the fiducial normal model. In the above algorithm, the cross-validation predicted variance Q2 and probability of model membership p were determined as follows. For each model, the predicted residual error sum of squares (PRESS) is calculated as PRESS )

∑ |xˆ - x | i

2

i

i

where xˆ i denotes the predicted vector (reduced NMR spectrum) for sample i when it is left out in the cross validation procedure. The cross-validation predicted variance is then defined as Q2 ) 1 -

PRESS SS

where SS is the total sum of squares of the centered data. Q2 is a measure of the robustness of a PCA model to sub-sampling of the data and was used to choose the number of components, A, for each model. The probability of model membership p corresponds to the PmodXPS+ parameter in SIMCA-P.25 This is calculated from the augmented residuals (DModXPS+), defined as: DModXPS+ )

x

d2⊥ + d2|| K-A

where K is the number of variables in the data, d⊥ is the standard residual (the perpendicular distance to the model hyperplane) and d|| is the distance from the predicted data point to the model boundary within the model hyperplane. PmodXPS+ is computed from an F test comparing the residual DmodXPS+ and the pooled residual, s0, for the model:

∑ DModXPS+ s 20 )

2 i

i

(N - A - 1)

See ref 25 for more details. The resulting model, based on 4023 control spectra, had 5 components which described 84% of the total variance, with Q2 ) 61% of the variance predictable according to cross-validation. Selection and Scaling of Dosed Samples. Using the normal model, samples from dosed animals were deemed “abnormal” if the augmented residual (DmodXPS+ in SIMCA-P) fell above the 99th percentile of the control training set. They were selected for use in the expert system if (a) more than 50% of the samples from the same treatment and time point were abnormal (see Figure 2d), and (b) the treatment showed significant toxicity by histopathology. The first criterion was applied to ensure a consistent response across animals. This procedure resulted in 2056 abnormal samples from 597 dosed animals subject to 62 treatments, designated as the dosed training set. Note that without the 50% filter, the number of abnormal dosed samples would increase slightly to 2305. The reduced spectra for these samples were then scaled using the scaled to maximum aligned reduced trajectories (SMART) procedure11 to align animals with different basal metabolic profiles and compensate for differing magnitudes of response. The procedure operates by subtracting predose spectra from all others, and then scaling all spectra such that the difference between the means of the predose and time point of maximum response is a constant. This was followed by variable stability (VAST) scaling12 to reduce the effects of hyper-variation in some metabolite levels. Initially, all variables are scaled to unit variance to give them an equal footing. They are then scaled by the inverse of the coefficient of variation, i.e., multiplying by mean/standard deviation so that those variables with the most variation relative to their mean are down-weighted relative to others which are more stable. In this study, the VAST weights were computed using only the control population, so that variation related to the treatment was retained and only naturally hypervariable metabolites were down-weighted. The resulting 2056 × 205 data matrix was then used to build a CLOUDS model23 where the data for all time points showing an effect according to the above criteria were combined to give the probability cloud for each treatment. The model produced a 62 × 62 similarity matrix in which each element estimated the multidimensional similarity between a pair of treatments. CLOUDS Modeling. The CLOUDS model is based on the Parzen density estimator,34 first applied to metabonomic data using probabilistic neural networks.18,35 Each reduced and scaled NMR spectrum is considered as a point xi in an M dimensional space, where M is the number of variables describing each sample. Classes are represented by probability densities, constructed by summing Gaussian densities centered at each data point. The relative probability of a new data object x belonging to class A is: Journal of Proteome Research • Vol. 6, No. 11, 2007 4411

research articles pA(x|xi∈A) )

1 NA(2πσ2)M/2



(

)

-|x - xi|2

exp

i∈A

2σ2

where σ is a parameter controlling the smoothness of the probability estimate and xi∈A denotes the set of spectra for the NA objects in class A. We can extend the basic CLOUDS model by using this probability estimate to define a measure of similarity between groups of data points. The similarity measure naturally emerges by considering the overlap integral, ∫pApBdx, of the probability densities of two classes A and B. Because the kernel densities are Gaussian, this integral is straightforward to evaluate:

OAB )

1 NANB(4πσ2)M/2

∑∑

exp

i∈A j∈B

(

)

-|xi - xj|2 4σ2

We note that this is equivalent to the mean probability density of all data points in class B according to the density estimate for class A, but with the smoothing parameter increased by a factor of x2. Finally, we normalize the overlap integral, defining the overlap similarity between classes A and B: OAB

SAB )

xOAAOBB

This measure varies between zero, when there is no overlap between clouds, and one when the clouds are identical. Variables with missing values (due to DRC interference) are excluded from the calculation of |xi - xj|. The smoothing parameter σ can be chosen according to the application. In this case we used the value of σ which maximized the structure of the similarity matrix as measured by its Shannon entropy: E(σ) ) -



1

0

p(S)log p(S)dS

where p(S) is the probability density of overlap similarity S. We approximated the integral by summing over a normalized histogram of the values from the lower triangular part of the similarity matrix: nbin

E(σ) ≈ -

∑ p log (p ), where i

i

i)1

pi )

2

∑ φ (S n(n - 1) i

j>k

jk),

φi(S) )

{

1 if S is in bin i 0 otherwise

and n is the number of classes. Figure 3 illustrates the principle. For example, when σ is too low, there is little overlap between any classes, all overlap similarities are near zero and the entropy is low. Alternatively, when σ is too high, all classes overlap greatly, the entire matrix has values close to one and entropy is again low. In between there is an optimal σopt which maximizes the entropy (in this case σopt ) 25) and this was used in all CLOUDS modeling procedures. We note that this procedure does not optimize the error rate of the classifier and therefore does not bias the estimation of the classification performance. Test for Spectral Interference. When comparing two groups of spectra affected by DRC interference, the corresponding regions were treated as missing values and it was important 4412

Journal of Proteome Research • Vol. 6, No. 11, 2007

Ebbels et al.

that the remaining regions provided meaningful information on the dose response. For example, two treatments may incorrectly appear similar because most real signals have been removed, leaving only regions of noise or nonspecific signal. We developed a test for the effect this interference would have on the overlap similarities (see Figure 4) that flagged any pair of treatments adversely affected in such a way. The data from two treatments (say T1 and T2) each have an associated set of excluded regions (say E1 and E2) due to their respective DRCs. These are combined when the two treatments are compared, excluding regions from both sets (i.e., the union of the two sets U12 ) E1∪E2). If the comparison is valid, and the remaining regions are useful for defining the dose response, then the similarity to controls should not be sensitive to the set of exclusion regions used. We employed the following method to decide if the similarity between any given pair of treatments was unreliable. 1. Compute the similarity of treatment 1 to controls using E1 exclusion set E1, SC1 . 2. Compute the similarity of treatment 1 to controls using U12 exclusion set U12, SC1 . E2 U12 and SC2 . 3. Repeat steps 1 and 2 for treatment 2 to give SC2 E1 U12 E2 U12 4. If either |SC1 - SC1 | or |SC2 - SC2 | is greater than a predetermined threshold (0.2 in this analysis) then mark the similarity S12 as unreliable. 5. Repeat steps 1-4 for each pair of treatments. We note that, although DRC interference may be identified a priori by inspection of the spectra, its effect on the expert system can be quite minimal if the affected spectral regions are not important in the classification. For example, although S12 might be unreliable, S13 might be unaffected. Therefore, we adopted the above post hoc validity check, rather than removing all such treatments at the beginning of the analysis. Classification of Treatments and Testing for Significance. The toxicity types were classified using histopathological criteria to assign to each of the 62 treatments an organ of main effect (36 liver, 17 kidney, and 9 other/multiple organ treatments). Treatments were then sequentially removed from the dataset and classified using a simple set of logical rules depicted in Figure 5. The three most similar treatments to the test treatment were identified (the top three “hits”) and their levels of similarity, S, allocated to one of four levels: Level 1: S < 0.1 Level 2: 0.1 e S < 0.6 Level 3: 0.6 e S < 0.9 Level 4: 0.9 e S These levels were selected so as to separate very high, very low, and intermediate similarities from each other and the results were found to be insensitive to changes ((0.1) in the thresholds. The flow chart was then followed to classify the treatment to liver or kidney toxicity. Some cases resulted in no prediction being possible. Table 2 gives the details of the top three hits and the final classification for each of the treatments. The above process of leave-one-out cross-validation simulates the real situation in which such an expert system might be used to determine the toxicity profile of a novel compound. In practice, the similarity scores across all treatments in the database would be interpreted by toxicologists in the light of background literature and knowledge of the compound structure. In this analysis however, we employed the above strategy in order to assess the performance of the expert system as a liver/kidney toxicity classifier in an objective way.

Toxicity Prediction with Temporal Metabolic Data

research articles

Figure 3. Determination of the CLOUDS smoothing parameter, σ. The figure illustrates for 12 treatments how the structure in the similarity matrix, as measured by its entropy, changes for different values of the σ. Similarity matrices corresponding to three values of σ are shown, with similarity on a scale of 0 to 1 indicated by the color bar. The maximum structure is attained at the maximum of the entropy function.

The significance of the classification success was analyzed by randomly permuting each row of the similarity matrix. By permuting the similarity scores, the top 3 hits used to classify each treatment are randomly sampled (without replacement) from all the treatments in the database. This corresponds to a null hypothesis in which there is no relationship between metabolic profiles and toxicity. All rows were randomly permuted 10 000 times and each time the number of correct and incorrect predictions was logged. The percentile rank (and corresponding p-value) of the actual results with respect to the distribution of permuted results was computed.

Results Model of Normal Rat Urine. The first stage of the expert system approach is to determine if urinary metabolites are significantly perturbed by the test treatment using the PCAbased model of normal urine. The scores plot of the control data for the first three components (Figure 2c) shows an approximately spherical cloud of data points, whereas the model residuals (not shown) showed no significant outliers. Of the 4963 postdose samples from dosed animals, 2451 (49%) fell outside the normal model and were therefore classified as “abnormal”. Note that many samples from dosed animals are expected to appear normal because they derive from time points before the onset or after the recovery from the toxic episode (see Figure 2d). In all, 88% of dosed animals exhibited at least one abnormal postdose sample based on the PCA model. Despite the lack of detailed biochemical information obtained through standard histopathology and clinical chemistry and the consequent difficulty of their use in calibrating

molecular techniques, one may ask to what extent the normal model reflects conclusions drawn from conventional methodologies. Histopathology and clinical chemistry confirmed that the desired toxic or other physiological perturbation was seen for the majority of treatments (see Table 2). Of the 80 treatments performed, 37 showed main effects in the liver and 17 in the kidney; 12 treatments were designated as showing “other” effects due to multiple organ toxicity (n ) 8), a single organ of main effect other than kidney or liver (n ) 2), or nontoxic physiological effects (n ) 2). For 14 treatments, no relevant histopathology or clinical chemistry effects were observed. Table 1 shows a good agreement (67 out of 80 treatments) between the two methods as to the presence or absence of an effect (p ) 0.007, Fisher’s exact test for contingency tables). Sixty-two of the 80 treatments showed an effect according to the normal model and were thus used in the subsequent similarity analysis. The 13 cases where there were differences between the methods may be attributed to: (a) treatments causing purely biochemical or physiological effects without structural lesions and therefore not detected by histopathology or clinical chemistry (n ) 4), (b) treatments exhibiting effects of insufficient magnitude to cause detectable histopathological lesions, but detectable by urinary metabonomics (n ) 5; in both a and b, the metabonomic approach is more sensitive to treatment effects than histopathology), or (c) treatments exhibiting low and inconsistent metabolic effects and thus falling below the 50% effect criterion (n ) 4). Of the latter four treatments, two exhibited a maximum metabolic response of exactly half the animals, thus narrowly failing the 50% criterion (“no effect”). No such “borderline” effects were observed in the rest of the database. The remaining two treatments Journal of Proteome Research • Vol. 6, No. 11, 2007 4413

research articles

Ebbels et al.

Figure 4. Test for spectral interference. (a) Data matrices from two treatments T1 and T2 have excluded regions (light blue columns E1 and E2). Controls have no excluded regions. The distribution of data points in metabolic space is shown schematically to the right in each panel. (b) When T1 and T2 are compared, their exclusion regions are combined (U12). To assess the effect of the combination of exclusion regions each treatment is compared to controls both with its own (c) and the combined (d) exclusion sets. If the change in similarity to controls between (c) and (d) was above a predetermined threshold (0.2), the corresponding similarity was marked as unreliable, and the similarity for T1 and T2 was said to have failed the test.

exhibited slight metabolic responses (below 50%), consistent with the observed histopathological effects, which were all of “minimal” or “slight” severity. These observations indicate that the agreement between the metabolic normal model and histopathological responses is good and may be even stronger than suggested by the table. CLOUDS Overlap Similarities. Abnormal samples from time points showing consistent metabonomic and histopathological effects were used to calculate a CLOUDS overlap similarity between each pair of treatments, forming a 62 by 62 symmetric similarity matrix (Figure 6). In the figure, the rows and columns of the matrix are ordered by a hierarchical clustering algorithm using single linkage. In the clustering procedure, each branch of the dendrogram may be rotated without affecting the hierarchical structure. By rotating branches, it was possible to separate treatments affecting the liver from all others, illustrating the substantial degree of information relating to toxicity inherent in the data (see Supporting Information). Each element in the matrix compares the total metabolic profile of one treatment with another, across all NMR detectable metabolites, affected animals and time points, giving an indication of their toxin likeness as determined by urinary metabonomics. A number of interesting features are seen in the clustered similarity matrix and are identified by numbered blocks in Figure 6. Block 1 contains the group of 7 replicate studies of hydrazine, a model liver toxin which also has effects in the kidney. These studies clearly show very high similarity to each other, indicating the high degree of reproducibility of the 4414

Journal of Proteome Research • Vol. 6, No. 11, 2007

metabonomic profiles, as previously demonstrated.14 Interestingly, one of these studies (row 12 of the matrix) was conducted using a different strain of rat to all other studies (Han-Wistar instead of Sprague-Dawley). The fact that this study appears so similar to the other six shows that the identified patterns are not excessively strain-specific, suggesting that the model might be more widely applicable among different rat strains. The eighth hydrazine study (row 4) does not group with this block, though it shows a mild similarity to the block members. This is because animals were euthanized prematurely and thus the data only cover half the time course covered by the other studies. This example shows the importance of matched time course studies in capturing the full trajectory of metabolic response to toxic insult. Block 2 contains two studies involving treatment with the liver toxin acetaminophen, one as a single acute dose (row 18) and one as repeated doses over a period of 7 days (row 17). These studies form the most similar neighbors to each other in the similarity matrix showing that, although the dosing regimes were different, the time-integrated metabonomic profiles were highly alike. This is an important observation, as it indicates that the biomarkers in single dose and multidose studies are effectively the same for this compound. Turning to the treatments with extrahepatic main effects, block 3 clusters the two treatments affecting the endocrine system, dexamethazone (row 46) and streptozotocin (row 47). Both these compounds are used as models of Diabetes Mellitus and both showed significantly increased urinary glucose levels

research articles

Toxicity Prediction with Temporal Metabolic Data

Figure 5. Logical rules used to classify each treatment, based on the classes of the top three “hits”. The similarity levels were (1) less than 0.1, (2) 0.1-0.6, (3) 0.6-0.9, and (4) greater than 0.9. Table 1. Number of Treatments Showing an Abnormal Effect as Determined by Tissue Histopathology and the Model of Normal Urinea model of normal urine histopathology result

effect

no effect

effect no effect

62 9

4 5

a Histopathology results were deemed to show an “effect” if a majority of animals showed toxicity-related changes, whereas the urinary model indicated an “effect” if more than 50% of the animals fell outside the model boundary (see Methods).

after 8h post dose, among other effects. Blocks 4 and 5 contain the best examples of renal tubular and papillary toxins respectively from the database. Included in block 4 are the treatments cisplatin, mercury II chloride, hexachlorobutadiene and aurothiomalate (rows 51-54 respectively), all of which showed significant tubular necrosis in histopathology examinations, whereas block 5 brings together the only two papillary toxins in the database, 2-bromoethanamine and 2-chloroethanamine (rows 56 and 57). Two further examples of close similarity in the matrix are the two treatments inducing metabolic acidosis, ammonium chloride and acetazoleamide (rows 43 and 44). In

the center of the matrix, a large block of similarity is observed that connects both kidney and liver treatments. This indicates that some of the patterns observed in the metabolic profiles are probably not organ specific but may relate to more general processes, for example perturbations to energy metabolism. There are numerous other examples of similarity structures that correlate with treatment effects in the matrix; however, a detailed discussion will follow in a later publication. The fact that the clustered similarity matrix can be segregated by target organ is promising, indicating that substantial information relating to the site of toxicity is latent in the matrix. We therefore developed a simple “expert system” using the similarities to predict the organ of main effect, between liver and kidney, for all 62 treatments. The COMET predictive screening system was tested using two blind treatments where compounds were given at sub-toxic doses, and by crossvalidation across the whole treatment database, showing that the approach is highly effective in classification, even in the absence of overt toxicity. Full results for each treatment are given in Table 2. Performance Testing with Blind Treatments. Two additional animal studies were conducted, using treatments (X and Y) selected from the dosed training set as a blind test of Journal of Proteome Research • Vol. 6, No. 11, 2007 4415

research articles

Ebbels et al.

Figure 6. Similarity matrix for 62 treatments sorted by hierarchical clustering (single linkage). Black elements represent pairs of treatments whose similarities are unreliable because of spectral interference from DRCs. Exemplar similarity blocks 1-5 are discussed in the text.

the model. The treatments were administered at approximately 25 and 50% of the original dose levels respectively to provide a robust practical test of the predictive power of the expert system. In both cases, no significant histological or clinical chemical abnormalities were found indicating that a marginally toxic dose was given. Both studies showed significant metabolic effects and their CLOUDS overlap similarities to the other 62 treatments were computed. The top matches were: treatment X - azathioprine, a centrilobular liver toxin (similarity 0.85) and treatment Y - maleic acid, a kidney tubular toxin (similarity 0.61). After the analysis was complete, the true identities of the subtoxic treatments were revealed; treatment X was dichloroethylene, a centrilobular liver toxin, and treatment Y was the kidney tubular toxin maleic acid. Both treatments X and Y had close second and third hits from different toxin classes, making a confident prediction of their toxicity type difficult. However, it is highly encouraging that the top hit corresponded to a correct toxicity match in both cases, despite the significantly lower doses used in both treatments and negligible standard toxicity test results. This suggests that the metabonomic screen had higher sensitivity than conventional methods for these compounds and that it was site specific in its predictions. Performance of the Expert System in Relation to Histopathological Criteria. Classical toxicology is dominated by histopathological assessment. Therefore, we endeavored to 4416

Journal of Proteome Research • Vol. 6, No. 11, 2007

compare our metabonomic approach to classifications using histopathological criteria. Thus, the overall performance of the system across the 62 treatments was assessed by comparing the predictions for each compound with the toxicity type assigned by histopathology in a leave-one-out cross validation procedure. Each treatment was removed in turn and its toxicity type predicted from the CLOUDS overlap similarities to the remaining treatments. We note that this is a more stringent test than the blind studies because the classification must be made on data from different treatments. For example, in the blind treatment case, a replicate maleic acid study (albeit with an altered dose level) was available in the training set and matched the test treatment. However, in the cross-validation approach, maleic acid had to be classified without any samples from a maleic acid study in the training set. Table 3 gives the overall results, again showing a significant association between the observed and predicted classifications (p ) 7 × 10-5, Freeman-Halton extension of Fisher’s exact test for contingency tables). A prediction was possible for 61% (38/ 62) of the treatments and for 31 of these (82%), the results concurred exactly with the class determined by histopathology; these were deemed “correct” predictions. The predictions differed from those of histopathology in three cases (8%), all kidney toxins classified as liver. Nine treatments whose histopathology indicated they affected other or multiple organs were

Dimethylformamide

Ethionine Hydrazine-HCl 7

Aflatoxin

Clofibrate

Allyl Alcohol

Dichlorobenzene Hydrazine-HCl 5

Hydrazine-HCl 3

Hydrazine-HCl 6

Hydrazine-HCl (Wistar) Hydrazine-HCl 4

Hydrazine-HCl 2

Hydrazine-HCl 1

Dichloroethylene

2

3 4

5

6

7

8 9

10

11

12

14

15

16

0.9% saline 0.9% saline 0.9% saline

p.o.

90m p.o. p.o. p.o. p.o.

90m

90m

90m

Microcystin-LR Butylated hydroxytoluene Diethylhexylphthalate

Sodium Valproate

R-naphthyl isothiocyanate Rotenone

Methapyrilene-HCl

Bromobenzene Indomethacin

N-methylformamide

19 20

22

23

24

25

26 27

28

21

1600 p.o.

Acetaminophen 2h

18

p.o

p.o.

1000 i.p.

1500 p.o. 25 p.o.

200m

100 p.o.

125 p.o.

2000m

20000 p.o.

0.08 i.p. 1000 p.o.

800 p.o.

Acetaminophen

17

Liver: hepititis, degeneration, necrosis, fatty change (5/5) Liver: hepatocellular vacolation, hepatic enzyme elevation (4/5) Liver: minimal regeneration

Liver: single cell necrosis (2/5), liver degeneration, fatty change (5/5) Liver: fatty change, degeneration (4/4)

Liver: fatty change, steatosis (5/5)

Liver: necrosis, glycogen depletion (4/5)

Liver: centrilobular necrosis (3/5)

time points showing effect (h)j

liver 24, 48 liver 24, 48, 72

liver 8, 24, 32, 48, 56, 72, 80, 96, 104, 120, 168 liver 8, 24, 48, 72, 96

liver 24 liver 8, 24, 48, 72, 96, 120, 144, 168 liver 8, 24, 48, 72, 96, 120 liver 8, 24, 32, 48, 72, 96, 120, 144, 168 liver 8, 24, 48, 72, 96, 120, 168 liver 8, 24, 48, 72, 96, 120, 144 liver 8, 24, 48, 72, 96, 120, 144, 168 liver 8, 24, 48, 72, 96, 120, 144, 168 liver 24

liver 8, 24

liver 8, 24, 48

liver 48

liver 8, 24 liver 24

liver 48

liver 8

true class

Liver: decreased glycogen content (5/5) liver 8, 24, 48, 72, 96, 120, 144 distilled water Liver: microvesicular steatosis (5/5) liver 8, 24, 48, 72, 96, 144 corn oil Liver: biliary necrosis (5/5) liver 8, 24, 48, 72, 96, 120, 144 corn oil Liver: hepatocellular apoptosis liver 8, 24, 48, 72, and degeneration (4/5) 96, 120 distilled water Liver: periportal necrosis (5/5) liver 8, 24, 48, 72, 120 corn oil Liver: centrilobular necrosis (2/5) liver 24, 48, 72, 96 0.2% carboxyLiver: hepatocellular necrosis (5/5) liver 24, 48, 72, 96, methylcellulose 120, 144, 168 0.9% saline Liver: subcapsular and centrilobular liver 8, 24, 48, 72 necrosis (3/5)

distilled water

0.2% carboxyLiver: centrilobular necrosis (5/5) methylcellulose 0.9% saline Liver: centrilobular necrosis (5/5) corn oil Liver: centrilobular necrosis (3/5)

0.2% carboxymethylcellulose

corn oil

0.9% saline

0.9% saline

0.9% saline

90m

150 p.o.

major histopathology and clinical chemistry findingsb

Liver: single cell centrilobular necrosis (5/5) 0.9% saline Liver: centrilobular necrosis (5/5) 0.9% saline Liver: single cell necrosis (5/5) 0.9% saline Liver: vacuolation, decreased glycogen (4/5) 1% carboxyLiver: hepatocellular methylcellulose necrosis (5/5) 0.25% Liver: panlobular peroxisome methylcellulose proliferation (4/5) 0.9% saline Liver: periportal and vascular necrosis (4/5) corn oil Liver: centrilobular necrosis (5/5) 0.9% saline Liver: necrosis, fatty change (5/5)

distilled water

vehicle

90m p.o.

400 p.o. 90m p.o.

120 p.o.

1000 p.o.

2 p.o.

800 p.o. 90m p.o.

1000 i.p.

160 p.o.

1h

13

Monocrotaline

treatment

1

no.

dose dose (mg/kg) route

Table 2. Detailed Results for the 80 Treatments and Two Blind Studies Used in the Analysisa

30

27 45

25

38

44

40

49

15 24

36

98

9

53

43

45

48

65

45

9 46

19

18

10

20 10

7

10

class

liver

class

0.55 liver

0.2

hit 2

0.49 other

0.42 liver

0.54 liver

0.31 liver

0.17 liver

0.65 liver 0.26 liver

0.65 other

0.55 liver

0.47 liver

0.86 liver

0.85 liver 0.86 liver

0.79 liver

0.77 liver

0.77 liver

0.85 liver

0.74 liver

0.7 liver 0.71 liver

0.66 liver

0.62 liver

class

OK

OK

Fail OK

OK

Fail

OK

0.63 liver 0.67 liver

0.61 liver

other

0.84 liver 0.84 liver

0.77 liver

0.76 liver

0.72 liver

liver liver no pred

-

liver

liver

liver

no pred

liver

liver liver

liver

OK OK

OK

OK

OK

0.73 kidney Fail

OK

OK OK

OK

no pred

liver

no pred

OK

liver

liver

no pred

liver

liver

no pred liver

liver

no pred

liver

no pred no pred

no pred

no pred

final class

OK

OK

0.48 kidney Fail

0.59 liver

0.62 liver

0.48 liver

0.41 liver

0.48 kidney -

0.31 other

0.13 liver

0.63 liver 0.26 liver

0.54 liver

0.54 liver

0.46 liver

0.83 kidney 0.8

0.84 liver 0.85 liver

0.79 other

0.77 liver

0.75 other

0.75 liver

Fail

Fail

DRC interferenceg

0.34 kidney Fail 0.04 liver -

0.12 other

0.02 liver

hit 3

0.73 kidney 0.73 liver

0.68 liver 0.67 liver

0.65 other

0.57 other

0.64 kidney 0.61 liver

0.63 kidney 0.62 other

0.53 liver

0.42 liver

0.55 other

0.35 liver

0.18 liver

0.75 liver 0.27 liver

0.75 liver

0.65 liver

0.51 liver

0.41 kidney 0.38 liver 0.08 liver 0.06 liver

0.76 liver

0.23 liver

no. samplesj hit 1

top 3 hits

Toxicity Prediction with Temporal Metabolic Data

research articles

Journal of Proteome Research • Vol. 6, No. 11, 2007 4417

4418

Journal of Proteome Research • Vol. 6, No. 11, 2007 i.v. p.o. i.p. p.o.

10m

75m

160m

500

37 Adriamycin-HCl

38 Cadmium chloride

39 3,5-dichloroanilineHCl 40 Chloroform

p.o. i.p.

200

100m

53 Hexachlorobutadiene

55 Sodium bicarbonate

0.35Me p.o.

i.p.

52 Mercury chloride

Sodium aurothiomalate

i.p.

5

51 Cisplatin

p.o. i.p.

i.p.

i.p.

2m

3 0.2

350

60

49 Ethylene Glycol 50 Caerulein

48 Folic Acid

47 Streptozotocin

46 Dexamethasone

44 Acetazoleamide 45 Food restriction

41 1-cyano-2-hydroxy3-butene 42 4-amino-2,6dichlorophenol 43 Ammonium chloride

distilled water

0.9% NaCl

corn oil

0.9% saline

10 mM citrate buffer 150 mM sodium bicarbonate 0.9% saline 0.05M NH4OH in 0.9% saline 0.9% saline

corn oil

0.9% saline

distilled water

0.9% saline

distilled water

0.9% saline 10% acacia + 0.1% antifoam 0.9% saline

distilled water 0.9% NaCl

corn oil none

vehicle

liver

liver

liver liver

liver liver

liver liver

true class

Kidney: metabolic alkylosis increased urinary pH, reduction in blood CO2 (5/5)

Kidney: tubular necrosis (5/5)

Kidney: tubular necrosis (5/5)

Kidney: tubular basophilia (4/5) Pancreas: subacute pancreatisis (5/5) Kidney: tubular, cortex, medullary necrosis (4/5) Kidney: tubular necrosis (5/5)

Pancreas: endocrine islet cell degeneration (5/5) Kidney: tubular necrosis (4/5)

kidney 72, 96, 120, 144, 168 kidney 8, 24, 48, 72, 96, 120, 144, 168 kidney 8, 24, 48, 72, 96, 120, 144 kidney 8, 24, 48, 72, 96, 120, 144, 168 kidney 24, 32, 48, 56, 72, 80, 96, 120, 144

kidney 8, 24 other 8

8, 24, 48, 72, 96, 120, 144, 168 other 8, 24, 48, 72, 96, 120, 144, 168 kidney 24, 48

other

86

kidney 24, 32, 48, 56, 72, 80, 96, 120, 144 kidney 8, 24 other 120, 144, 168

177

46

47

47

25

20 10

20

52

55

15 10

39

kidney 8, 24, 48, 72, 96

27

10

19

41

29

37

14 35

10 62

24 29

class

liver

other

liver

class

0.87 liver 0.87 other

other

0.92 liver

0.92 other

0.9

0.82 liver

0.83 liver

OK

OK

OK

OK

OK

OK

OK

OK

other

0.74 other

0.55 liver

0.74 liver

OK

OK

OK

0.73 liver 0.72 kidney Fail 0.71 kidney 0.71 other OK

0.76 other

0.6

0.82 other

0.83 kidney 0.82 liver -

0.83 kidney 0.82 kidney OK

0.83 liver

0.83 liver

0.89 kidney 0.86 liver

0.87 liver

kidney 0.87 liver 0.89 other

0.9

0.92 liver

0.92 liver

0.93 other

OK OK

OK OK

OK OK

DRC interferenceg

0.67 liver

0.61 liver

0.61 liver

OK

0.86 kidney 0.83 kidney 0.78 kidney OK

0.94 kidney 0.89 kidney 0.83 kidney OK

0.94 kidney 0.87 kidney 0.86 kidney OK

0.89 kidney 0.87 kidney 0.78 kidney OK

0.81 liver 0.71 other

0.77 liver

0.85 other

0.85 other

class

0.81 liver 0.86 liver

hit 3

0.87 kidney 0.85 liver 0.92 liver 0.91 liver

0.88 other 0.89 liver

0.81 other 0.89 liver

hit 2

top 3 hits

0.86 other 0.83 liver 0.86 kidney 0.82 other

0.83 other

0.84 liver

0.87 liver

0.9

0.9

0.9

0.93 liver

0.94 liver

0.94 liver

0.92 liver 0.92 other

0.91 liver 0.92 liver

0.87 liver 0.89 other

no. samplesj hit 1

25

8, 24, 48

24

24, 48, 72, 96, 120, 144, 168 8, 24

24, 48 48, 72, 96, 120, 144, 168 8, 24, 48, 72, 96 8, 24, 48

48 8, 24, 48, 72

24, 48, 72 8, 24, 48, 72

time points showing effect (h)j

8, 24, 96

other

other

Liver: decreased glycogen content other (4/5). Kidney: glomerulopathy (5/5) Testes: tubular bilateral necrosis (3/5) other Liver: hepatic enzyme elevation (4/5) Kidney: single cell necrosis (4/5) kidney

Liver: centrilobular fibrosis (2/5)

Liver: apoptosis and necrosis (5/5)

Liver: inlateral haemorrhage (1/5) Liver: decreased glycogen content (3/5) Liver: parenchymal necrosis (4/5) Liver: necrosis (2/5)

Liver: centrilobular necrosis (5/5) -

major histopathology and clinical chemistry findingsb

Liver: steatosis, necrosis (4/5) Kidney: regeneration (5/5) 150 i.p. 0.9% saline Kidney: necrosis. Pancreas: atrophy Intestine: distended (4/5) 70 i.p. 2% DMSO in Kidney: tubular medullary coil oil proteinosis (4/5) 0.28Me p.o. 0.9% saline Kidney: metabolic acidosis decrease in urinary pH, adaptive changes to renal epithelial cells (5/5) 600 p.o. methylcellulose Kidney: papillary necrosis (1/5) 50% and diet none Liver: decreased glycogen 100%f modificontent (5/5) cation 100 i.p. 10% DMSO in Liver: vacuolation. Spleen: atrophy corn oil Thymus: necrosis (5/5)

i.p.

114m

36 Lead acetate

i.v.

i.p. p.o.

p.o. surgical study p.o. i.p.

dose route

5

35 Lipopolysaccharide

60 300

80 20

31 Dimethylnitrosamine 32 1-fluoropentane

33 Chlorpromazine 34 Azathioprime

3200 -

treatment

dose (mg/kg)

29 Carbon tetrachloride 30 Partial hepatectomyc

no.

Table 2 (Continued)

liver

kidney

kidney

kidney

kidney

no pred no pred

no pred

no pred

no pred

no pred no pred

kidney

liver

liver

liver

no pred

liver

liver

liver

no pred

liver liver

liver liver

liver liver

final class

research articles Ebbels et al.

Vancomycin-HCl

61

4

Mitomycin-Ci

p.o. p.o.

p.o.

i.p. i.p.

i.p.

p.o.

i.p. p.o. p.o. p.o.

i.p. diet modification

i.p. p.o.

p.o.

p.o. i.p. i.v. i.p.

i.p.

i.v.

i.p. i.p. p.o. surgical study s.c.

dose route

Kidney: papillary necrosis (5/5) Kidney: tubular necrosis (2/5) Kidney: tubular necrosis (5/5) -

kidney kidney kidney kidney

true class

8, 24 8, 24, 48 8, 24, 48 8

time points showing effect (h)j

5 12 0 0 0 0 0 0

120 48

corn oil distilled water

corn oil

0.9% saline 0.9% saline

0.9% saline

0.9% saline 0.9% saline 0.9% saline 10% acacia + 0.1% antifoam 0.2% carboxymethylcellulose

-

-

other

other -

liver 8, 24 kidney 8

Bone marrow: apoptosis. Intestine: necrosis. Liver: regeneration (4/5) Intestine, thymus, spleen, bone marrow: damage to DNA and rapidly dividing cells Kidney: tubular necrosis no lesionsk Liver: centrilobular hepatic necrosis Kidney: tubular necrosis (4/5) no lesionsk no lesionsk no lesionsk

Liver: periportal necrosis (3/5) no lesionsk no lesionsk no lesionsk other

liver -

15 9

0

0 0

7 10

27

-

no lesionsk no lesionsk

8 48

no lesionsk

10% acacia + 0.1% antifoam 0.9% saline 10% acacia + 0.1% antifoam corn oil none

10 10 9 25

-

no no lesionsk no lesionsk no lesionsk

distilled water 0.9% saline 0.9% saline 0.9% saline

no lesionsk no lesionsk

lesionsk 8 8 8 8, 72, 96, 120, 144, 168 8, 24, 48

6

10

16

20 30 25 10 liver

kidney kidney liver liver

class

-

-

-

-

-

-

-

-

-

liver liver kidney other

class

0.4

0.64 0.51 0.55 0.52

hit 3

liver

liver liver kidney liver

class

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

0.23 other

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

0.17 liver

0.24 kidney 0.23 other

0.44 liver

0.65 0.57 0.56 0.57

hit 2

0.85 liver 0.82 kidney 0.82 other 0.61 kidney 0.58 liver 0.47 liver

-

-

-

-

-

-

-

-

-

0.27 liver

0.24 other

0.6

0.65 0.65 0.59 0.59

no. samplesj hit 1

-

0.3% rat album Blood: low glucose other 8, 120, 168 in +0.9% NaCl 0.9% saline Kidney: tubular regeneration, kidney 8 increased blood urea nitrogen, creatinine, urinary volume (5/5) 0.9% saline Kidney: tubular necrosis (4/5) kidney 8

0.9% saline 0.9% saline distilled water none

vehicle

major histopathology and clinical chemistry findingsb

top 3 hits

OK OK

-

-

-

-

-

-

-

-

-

Fail

OK

Fail

OK OK OK Fail

DRC interferenceg

no pred kidney

-

-

-

-

-

-

-

-

-

no pred

no pred

no pred

liver kidney kidney no pred

final class

a The table is ordered according to the clustered similarity matrix of Figure 6 and divided into five sections: liver toxins, kidney and other toxins, treatments showing metabolic effects but without histopathological/ clinical chemistry effects, treatments showing no metabolic effects, and the blind studies. b Results are quoted for the organ and necropsy group (48 or 168 h) with the most significant effect. The proportion of animals with the effect is given in parentheses. c Partial hepatectomy involves the surgical removal of part of the liver. d Unilateral nephrectomy involves the surgical removal of one kidney. e Sodium bicarbonate and ammonium chloride were dosed ad libitum in drinking water for 5 days. f Food restriction: 50% dose group received half the control group intake for 5 days, 100% group received no food for 24 h. g “OK” indicates that the comparison passed the DRC interference test. A dash indicates that the DRC interference test was not applied: the top three hits could not give a consistent predicted organ. h Acetaminophen 1: Animals were dosed once daily for 7 days in a multiple dosing protocol. Acetaminophen 2 followed the standard single dose protocol. i No significant metabonomic effect: There were no time points where >50% of the animals showed an effect according to the normal model. j Time points and numbers of samples where >50% of the animals showed an effect according to the normal model. Time points are labeled by the last hour of the collection period, e.g., 0-8 h is labeled 8. k No lesions relevant to toxicity were detected. l A significant metabonomic effect was detected, though no relevant histopathology lesions were found. m Dose refers to the salt. In all other cases, the dose refers to the active moiety.

40 200

400

Methotrexatei

Blind X Dichloroethylene Blind Y Maleic acid

75 50 350 100

Allyl formatei Furosamidei Ketaconazolei Lithocholic acidi

250

1000 48 h

Trichloroethylenel Water deprivationl

WY14,643i

15m 500

Paraquat-HCll Retinyl palmitatel

100 100

100

Ifosfamidel

Phenobarbitali Thioacetamidei

15 889 10m 150

900

750 UI/kg 200m

150 1000 400 -

dose (mg/kg)

2,4 Buthionine sulphoxidel Gadolinium Chloridel Gentamycinl

Dinitrophenoll

Cephaloridine

Insulin

60

62

2-bromoethanamine 2-Chloroethanamine Maleic acid Unilateral nephrectomyd

treatment

56 57 58 59

no.

Table 2 (Continued)

Toxicity Prediction with Temporal Metabolic Data

research articles

Journal of Proteome Research • Vol. 6, No. 11, 2007 4419

research articles

Ebbels et al.

Table 3. Overall Classification Results for 62 Treatmentsa metabolic prediction histopathology class

liver

kidney

no prediction

total

liver kidney other/multiple organ total

24 3 4 31

0 7 0 7

12 7 5 24

36 17 9 62

a

This table does not represent a traditional “confusion matrix” because the third column and row are not equivalent.

also presented to the system. Of these, no prediction could be made for five treatments, whereas the remaining four were all classified as liver toxins. Three of the four did indeed show histopathological effects in the liver, among other effects (see Table 2). Of the five unclassifiable other/multiple organ treatments, three corresponded to treatments in which neither kidney nor liver effects were seen by histopathology, thus reflecting the conservative judgment of the system. Further insight into system performance can be obtained by analyzing the sensitivity and specificity for each organ class. Here, we define sensitivity as the proportion of all treatments affecting a given organ that are classified to the same organ. Similarly, we define specificity as the proportion of all treatments predicted to affect a given organ that truly affect that organ. The system has sensitivities to liver and kidney toxins of 67 and 41% respectively, whereas the corresponding specificities are 77 and 100%. Including the three cases where liver toxicity was detected when mixed with other organ effects raises the liver specificity to 87%. Although the numbers of treatments predicted to each toxin type are small (24 liver and 7 kidney), the specificities indicate that when a toxicity type is predicted there is a high probability that this is correct. The relatively low sensitivity to kidney toxins may reflect the highly diverse range of nephrotoxic sites and mechanisms and thus relatively low coverage in the database. Several reasons can be identified for the 24 treatments for which no prediction could be made. For 7 of the treatments, the best matching treatments could not be assigned to the liver or kidney by histopathology. For a further 11, the best matching treatments suffered too severely from spectral interference due to DRCs. In addition, there were 5 treatments where there was no majority vote between liver or kidney toxicity from the three best matches. Finally, in one case, none of the three best matches was above the minimum similarity threshold. The three incorrect classifications could be attributed to possible low-level multiple organ effects, pH-induced shifts in NMR resonances, and the lack of sufficiently similar treatments in the database for some kidney toxins. These results appear highly promising. However, we have also investigated if such classification rates could have been obtained by chance. It is instructive to consider, as an example, the performance of a naı¨ve classifier that predicts all treatments to be members of the liver class. This approach, though correctly classifying 58% (36/62) of treatments, misclassifies all 17 kidney toxins resulting in an error rate of 27% (17/62). This illustrates the importance of both a high success rate and a low error rate in evaluating a classification system. It also highlights the value of the “no prediction” option in our classifier; in real applications, classifying a liver toxin as affecting the kidney is a more costly mistake than merely making no prediction. As a further check to rigorously determine the statistical significance of our results, we performed a 4420

Journal of Proteome Research • Vol. 6, No. 11, 2007

permutation test. In 10 000 random permutations of the similarity matrix, the actual number correct (31) and the actual number incorrect (3) were significant at p-values of 0.008 and