Article pubs.acs.org/ac
Novel Method for Calculating a Nonsubjective Informative Prior for a Bayesian Model in Toxicology Screening: A Theoretical Framework Michael Woldegebriel* Analytical Chemistry, Van’t Hoff Institute for Molecular Sciences, University of Amsterdam P.O. Box 94720, 1090 GE Amsterdam, The Netherlands S Supporting Information *
ABSTRACT: In toxicology screening (forensic, food-safety), due to several analytical errors (e.g., retention time shift, lack of repeatability in m/z scans, etc.), the ability to confidently identify/confirm a compound remains a challenge. Due to these uncertainties, a probabilistic approach is currently preferred. However, if a probabilistic approach is followed, the only statistical method that is capable of estimating the probability of whether the compound of interest (COI) is present/absent in a given sample is Bayesian statistics. Bayes’ theorem can combine prior information (prior probability) with data (likelihood) to give an optimal probability (posterior probability) reflecting the presence/ absence of the COI. In this work, a novel method for calculating an informative prior probability for a Bayesian model in targeted toxicology screening is introduced. In contrast to earlier proposals making use of literature citation rates and the prior knowledge of the analyst, this method presents a thorough and nonsubjective approach. The formulation approaches the probability calculation as a clustering and random draw problem that incorporates few analytical method parameters meticulously estimated to reflect sensitivity and specificity of the system. The practicality of the method has been demonstrated and validated using real data and simulated analytical techniques.
B
research question, formulating the hypothesis using Bayes’ theorem requires a prior belief adapted from either earlier research or a certain level of belief in regards to the hypothesis. This prior probability can then be incorporated with the data to give the final optimal probability (posterior probability). In general, the two statistical approaches (frequentist and Bayesian) differ in their attempt to define probability based on physical or evidential information.10 In toxicology screening laboratories, the routine question encountered by analysts is, is/are there any toxic compound(s) in a given sample? To answer this question, several analytical techniques can be used.11−16 The most common techniques are liquid chromatography (LC) and gas chromatography (GC) coupled to mass spectroscopy (MS). After obtaining a chromatogram of a given sample, to answer the question, a two-step process of identification and confirmation is required.17,18 In most cases, due to the uncertainties involved in the analytical process (e.g., retention time shift, error in mass spectrometry, and software limitations), a probabilistic approach is preferred.19−24 However, if the probabilistic method is followed, the only statistical approach that can allow us to answer the actual question of “is/are the
ayes’ theorem, the core of Bayesian Statistics, was developed in 18th century by the English statistician Thomas Bayes.1 Bayes’ theorem gives the evidence of the true state of a hypothesis as degrees of belief (Bayesian probabilities). The practical application of Bayesian statistics only became prevalent at the end of the 20th century.2 Currently, Bayesian statistics is being used in almost all fields of science due to the advancement in computational power of modern computers, which has dramatically changed the simulation power that some statistical techniques, such as Bayesian Markov Chain Monte Carlo, require. Bayesian inference has found a wide range of applications, including analytical chemistry.3,4 The ability to make inferential decisions for hypotheses based on data is the concern of all statistical approaches. In the frequentist approach (another school of thought), the definition of probability is the chance of observing the collected or more extreme data given that the hypothesis is true.5,6 This approach does not allow us to answer the real question, “is my hypothesis true”. In contrast to the frequentist approach, the answer to “is my hypothesis true” can be given through a Bayesian method. Bayesian statistics approaches the inferential process by combining two sources of information and calculating what is referred to as the posterior probability.7−9 The first source is the data (likelihood), which is the sampling information, and the second source is the prior information possessed about the hypothesis itself. Depending on the © 2015 American Chemical Society
Received: July 27, 2015 Accepted: October 20, 2015 Published: October 20, 2015 11398
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry compound(s) of interest (COI) present?” is Bayesian statistics. To fully exploit the benefit of using Bayesian statistics, a robust model for feature extraction, which calculates the first part of Bayes’ theorem (likelihood), is needed. As long as the model accounts for all possible characteristics of the LC-MS/MS (GCMS/MS) data and utilizes its multidimensionality, the likelihood can be efficiently estimated. Several studies on probabilistic feature extraction have been published.25−27 The second part of Bayes’ theorem (prior) is a challenging subject. In Bayes’ formulation, the prior probability is the prior belief the analyst has about a certain toxic compound being present in the sample before actually looking at the data. The prior belief for a certain toxic compound to be present/absent in a given random sample influences the final posterior probability, which reflects the actual presence of the compound in the sample in question. If the prior probability has been estimated correctly, Bayes’ theorem has a huge advantage toward not only answering the main question efficiently but also to make the automation of screening process reliable with minimal supervision. Elaborate discussion on the influence of prior probability in Bayesian formulation can be found on reference.28 If we attempt to calculate the prior probability through the frequentist definition of probability,29 it will require access to a representative number of trials that can help determine how frequently the COI is present in a random sample. Approaching the prior probability in either way, as evidential (belief) or physical (frequency), require an approximation of the true probability. For Bayesian approach, even though the prior probability plays a significant role for compound identification, few publications have proposed a practical means to determine the probability distribution. One of these proposed approaches starts by looking into the citation rate of a given compound being identified in a certain type of sample by searching the literature.30,31 This approach assumes the general citation of a chemical compound to be related to the prior probability distribution. Similarly, the chemical knowledge of an analyst and different compositions of natural matrices that have been observed over long period of time has also been proposed to give a practical and usable prior probability estimation.32,33 However, even if these proposed techniques are somehow useful, there remains an obvious drawback of having either a subjective element of the analyst or of not being thorough. The purpose of this manuscript is to propose a novel solution for the prior dilemma a Bayesian approach could face in compound identification. In contrast to the earlier proposals, this approach introduces a method to estimate a nonsubjective prior that only takes into account method parameters that can be estimated practically. The proposed formulation attempts to exploit the benefit of both frequentist and Bayesian definition of probability.
■
⎧ H 0 : COI is present in the sample ⎨ ⎩ H1: COI is absent in the sample
(1)
P(H 0) + P(H1) = 1
(2)
⎪
⎪
The two hypotheses in eq 1 are mutually exclusive. Therefore, as indicated in eq 2, the sum of the probabilities (P) should add up to 1. To proceed further in answering the question, the obtained chromatogram will constitute the data (D), and Bayes’ theorem in its odd form can be applied as follows: P(H 0|D) P(D|H 0) × P(H 0) = P(H1|D) P(D|H1) × P(H1)
(3)
The formulation of eq 3 is a simplified form of Bayes’ theorem, where P(D) cancels out. Both P(H0|D) and P(H1|D) are posterior probabilities indicating the probability of COI being present or absent, given the chromatogram as data (D), respectively. P(H0) and P(H1) are the prior probabilities of the suspected compound to be present (H0) or absent (H1) in a random sample, and (P(D|H0))/(P(D|H1)) is the likelihood ratio reflecting the information in the data. To calculate the likelihood odds, a mathematical model capable of detecting the features of COI can be applied. As indicated in the previous section, methods to detect the COI within the data have been proposed by several publications. For the prior odds ((P(H0))/ (P(H1))), taking the actual question into consideration, if the calculations of P(H0) and P(H1) are approached independently, the option to define the probability using the frequentist method becomes apparent. Calculating the prior probability using the frequentist method makes use of the physical (frequency in large trials), rather than the subjective probability. Intuitively, approaching the definition of P(H0) this way assumes the availability of a large data representing the frequency of occurrence (presence/absence) of the COI in the general population of samples. Thus, with this definition, a hypothetical data set A, consisting of a large number of screening results for COI, can be assumed. In the set, if the screening results are represented with binary notation (0 or 1), indicating negative or positive outcomes for the presence of the COI, then using this information the probability for a random sample to contain COI can be formulated as follows: ⎧ S = {x ϵA: x = 1} ⎪ ⎨ S̅ = {x ϵA: x = 0} ⎪ ⎩ n(S) + n( S̅ ) = n(A) P(H 0) ≈
n(S) n(A)
(4)
(5)
In eq 4, n(S) and n(A) refer to the number of elements for set S and A, respectively, where set S and S̅ are subsets of A containing elements with positive and negative outcomes for the presence of COI, respectively. The large data set (A) can thus be used to approximate the true probability of occurrence (presence/absence of COI) using eq 5. Although it is a useful approximation, the final result is strictly dependent on the total number of cases considered for calculation. The question of how many samples need to be included is an arbitrary decision. In addition, projecting the calculated probability for future cases assumes a 100% sensitivity and specificity rate based on the registered frequency, independent of the concentration level of
THEORY
To illustrate the proposed method, a chromatogram obtained from a random sample from a random toxicology laboratory, analyzed with an LC-MS/MS (GC-MS/MS) system, in an attempt to answer the question of whether the compound of interest (COI) is present/absent in the sample, is assumed. With this assumption, two competing hypotheses can be formulated: 11399
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry
Figure 1. General workflow for calculating the probability of the presence/absence of a compound of interest (COI) in a random sample.
citation rate), this method can make use of not only the large data set available but can also constrain the probability calculation with few parameters that can be experimentally estimated. In general, the formulation proposed for calculating P(H0) for COI can be approached as a two-step process. The first step attempts to determine all necessary analytical method parameters which can be used to create analytically significant concentration level grid (depicted in S1 and S2 in Figure 1), and the second step adjusts the concentration level grid based on large data set using clustering, in which a random draw approach can then be followed to estimate the probability (depicted in S3 and S4 in Figure 1). Details will follow further on in the manuscript. Estimation of Analytical Method Parameters. The first parameter that needs to be estimated is the minimum detectable limit (MDL, refer to Table 1 for definitions).34 The MDL in theory is dependent not only on the instrument detection limit (IDL) but also the software detection limit (SDL). Please refer to Table 1 for definitions of abbreviations.
the COI. In practice, the practicality of compound identification is limited by the analytical method (i.e., instrument and software) used for identification/confirmation purposes, regardless of the samples true state of nature. To further elaborate this notion, two hypothetical samples, “a” and “b”, can be assumed. If samples “a” and “b” are spiked with COI below and above the minimum detectable limit (MDL) of the analytical method, respectively, the screening result obtained will lead to the conclusion that sample “a” is a true positive and sample “b” is a false negative. On the other hand, assuming a real case scenario, where we have no a priori knowledge about the content of the samples, regardless of the true state of nature, the results obtained after screening are accepted as the truth. This is a simple indication that, regardless of how many samples are considered for the probability (frequency) calculation using eq 5, fundamentally, the analytical methods used plays a significant role in determining which samples belong to set S or set S.̅ The method proposed here bases its argument on the fact that the frequency of outcomes over a large number of trials (screening), which can be used for an approximate calculation of P(H0), as indicated on eq 5, is strictly governed by the analytical method’s capability to identify/confirm COI when actually present. In reality, if the currently available state-of-theart screening techniques are not capable of identifying the COI at a certain concentration, then there is no means available to prove otherwise. In fact, the final result is always considered the ground truth for unknown sample. This perspective has an advantage when it comes to calculating P(H0). If the analytical method parameters defining the methods capability and limitations are known a priori, it can be used to predefine the outcomes for unobserved case scenarios. That way, a rough approximation of the population sample can be made which can be adjusted using a large data set consisting of real observed cases. Based on this notion, the formulation presented here approaches P(H0) calculation as a clustering and random draw problem which makes a thorough assessment of routinely applied screening techniques, to reach a nonsubjective prior. In contrast to earlier proposals that attempted to approximate the probability solely based on frequency (e.g.,
Table 1. Abbreviation Definitions abbreviation CL
CLG MIL MDL
DSL
IDL SDL C*
11400
definition Concentration level (s), singular or plural form depending on the context; the quantitative reference of a given compound in a sample. Concentration level grid; numerical grid representation of quantities for a given compound of interest. Minimum injectable limit; Concentration level required for a given COI to give an observable signal above the predefined noise level. Minimum detectable limit; Concentration level required for the analytical method in question to identify the COI when actually present. Detector saturation limit; Concentration level required for a given mass spectrometry detector to give a nonlinear response to a concentration. Instrument detection limit; Minimum concentration level required for a given analytical instrument to detect the presence of the COI. Software detection limit; Minimum signal intensity required for an algorithm applied to be able to identify the COI. Concentration variation; Minimum concentration level required to create a variation in signal-to-noise ratio with a predefined threshold DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry A practical approach to estimate MDL is to spike different types of biological matrices (e.g., blood and tissue in forensics, food types for food-safety) with known concentrations of COI and to determine the concentration level (CL) needed for positive identification of COI using a given analytical method. As part of the proposed formulation, a probabilistic model (Bayesian) that calculates the appropriate MDL by taking in to account different types of biological matrices, instrument, and software can be formulated. Given the ground truth that the compound has been spiked in the biological sample, the two competing hypotheses can be defined as follows: ⎧ H 2 : COI is observable ⎨ ⎩ H3: COI is not observable
(6)
P(H 2) + P(H3) = 1
(7)
⎪
⎪
The hypotheses defined in eq 6 are mutually exclusive and complementary. The probabilities, as indicated in eq 7, therefore, add up to 1. The Bayesian formulation to estimate the probability of the hypothesis can then be presented as follows: P(H 2|D) =
Figure 2. Graphical representation of the analytical instrument and software combinations necessary for analytical method parameter estimation.
P(D|H 2) × P(H 2) [(D|H 2) × P(H 2)] + [(D|H3) × P(H3)]
allows a decision to be made by weighing the information coming from any possible combination of software and instrument. Vendor supplied instrument and any existing open-source or commercially available software, developed for screening purposes and utilized in different toxicology laboratories, can be considered. P(Dr|Ii, Sj, H2) in eq 9 refer to the probability of the data (D) given instrument Ii and software Sj, and the assumption that the hypothesis H2 is true. For this, feature extraction (e.g., peaks, isotopic distribution and fragment ions) needs to be performed. An algorithm that gives a probabilistic output concerning the presence/absence of the expected features is directly compatible with the formulation presented in eq 8. The second parameter that needs to be estimated is the minimum injectable limit (MIL, refer to Table 1 for definition). This parameter takes into account that a signal above noise does not necessarily imply COI identification/confirmation. Here noise can be defined as any interfering signal emerging from any source other than COI. To estimate the value for this parameter, we can make use of the same formulation presented in eqs 6−9) and replacing hypotheses H2 and H3 with H4 and H5, respectively. The hypotheses can be defined as follows:
(8) z
P(D|H 2) =
y
x
∏ ∑ ∑ P(Dr |Ii, Sj , H2) × P(Ii|H2) r=1 j=1 i=1
× P(Sj |H 2)
(9)
Eq 8 is straightforward Bayes’ theorem, where P(H2|D) is the posterior probability, P(D|H2) is the likelihood, P(H2) is the prior, [(D|H2) × P(H2)] + [(D|H3) × P(H3)] in the denominator is the normalizing factor, and D refers to the data. In eq 9, two new parameters, I and S, referring to instrument and software types are introduced into the likelihood of eq 8. This step is necessary to access the performance of a given combination of analytical methods taking IDL and SDL into account (refer to Table 1 for definitions). Schematic of this consideration can be found on Figure 2. The introduced parameters are then marginalized (Bayesian approach to integrate out nuisance parameters; parameters are discrete in this case and thus the sum), where x and y are referring to the maximum number of instruments and software taken into consideration. The experimental replicates for a given CL are considered independent and are therefore multiplied, with r referring to the index of a data obtained from a specific replicate and z referring to the maximum number of replicates considered. Here it should be noted that, technically, a single replicate of the spiked sample constitutes the data (Dr), which is further split into the chromatogram obtained by the analytical method in question. The prior probabilities, P(H2) and P(H3), in eq 8 in contrast to the priors in eq 3, only refer to whether the spiked COI is observable/nonobservable given the analytical method in question and the ground truth that COI exists in our sample. Since there is no presumption of the analytical method to be evaluated, a flat prior can be assigned. If, however, there exists a prior knowledge for suspecting that it is more/less probable for the COI to be observed/not observed with a given combination of analytical methods (instrument, software), the option to reflect that assumption as a probability is an option. This step
⎧ H4 : Signal above noise is observable ⎨ ⎩ H5: Signal above noise is not observable ⎪
⎪
(10)
Similarly, the detector saturation limit (DSL) and concentration variability (C*; refer to Table 1 for definitions) can also be estimated by observing the statistical output from the experimental analysis. The estimation of C* might not be straightforward. Therefore, an acceptable variation (threshold) in the signal-to-noise ratio can be employed. Graphical illustration of the method parameter extraction can be found in Figure 3. Clustering Approach: Adjusting CL Grid Using KMeans Clustering. Once all of the analytical method parameters have been estimated, a concentration level grid (CLG) between 0 (CL referring to no trace of COI) and DSL, using C* as an interval, is created (depicted in S2 in Figure 1). 11401
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry
Figure 3. Graphical representation of the probability distribution of the compound of interest (COI) observation versus its concentration level. Analytical method parameters extraction is overlaid.
CCL in to DCL can be performed by applying the clustering approach discussed in the previous section. Here it should be noted that the centroids (DCL) of the final clusters obtained are translated representation of the hypothetical population sample, in which the large data set is assumed to be drawn from. The probability for a random biological sample encountered at a random laboratory to contain the COI can therefore be considered as a random draw problem. This way, the probability of COI being present within a random sample can be viewed as the chance of randomly selecting any of the CL from the DCL set, with value greater than or equal to MDL. This way, as long as the randomly drawn CL from the DCL set for a given COI is greater than or equal to MDL, it can be assumed to be positively identified. This assumption can then be used to calculate P(H0) as follows:
This way, all analytically significant hypothetical CL are contemplated. Once the CLG has been obtained, the process of translating a large data set (consisting of continuous CL of COI for a significant number of real case observations; depicted in S1 in Figure 1) into an optimal discrete CL can be approached as a clustering problem (depicted in S3 in Figure 1). In this step, all observed continuous CL from the large data set can be assigned to a specific cluster by using a mathematical means. One of the most efficient ways to do this is K-means clustering. In K-means, K represents the number of clusters to be created using the data, which is equal to the number of initial means (centroids) selected to initiate the clustering process. This clustering method is highly sensitive to the initial centroid values, which results in either a global or local optimal solution.35 This drawback is not an issue for this work because the CLG obtained from the analytical method parameters can be used to initiate the clustering process, resulting in the global optimal solution. In general, taking in to account the only variable of interest is CL of the large data set; one-dimensional K-means clustering can be applied (details in the Supporting Information). The centroids of each cluster can then be adopted as a discrete concentration level (DCL) representation of the continuous concentration level (CCL) for COI. However, the observed CCL data set of COI applied for the clustering step has to be adopted from real cases and should consist of sufficient number of samples randomized over all necessary factors that could potential influence the outcome (biological matrix, location, time, analytical method). Here it should be noted that, the clustering step is not only dependent on the initial centroid values assigned, but also on the content of the presumed large data set, which might not necessarily be representative enough of all CL normally encountered in laboratory settings. To control for this information leakage, exhaustive bootstrap sampling can be applied.36 Details on a probabilistic bootstrap sampling developed can be found in Supporting Information. The posterior probability distribution obtained from all bootstrap samples is assumed to be normally distributed; therefore, the mean can be considered as the maximum likelihood estimate for P(H0). Random Draw Approach. To facilitate the discussion, the probability calculation for a single bootstrap sample will be explained. Given a single bootstrap sample, a translation of
P(H 0) ≈
n(T) n(T̅ ) + n(T)
(11)
n(T) = n(DCL ≥ MDL)
(12)
n(T̅ ) = n(DCL < MDL)
(13)
In eq 11, T refers to a set containing all CL values from the DCL set greater than or equal to the MDL, and T̅ refers to a set containing all CL values less than the MDL. Therefore, n(T) and n(T̅ ), as indicated in eqs 12 and 13, refer to the number of elements in the two sets, respectively. Due to the limitations of the analytical methods considered to differentiate variations for CL anywhere between 0 (CL referring to no trace of COI) and MIL, the large data set applied for the clustering step is generally assumed to contain no such values. As a result, during the iterative clustering process, all initial CL values from CLG between 0 and MIL will eventually be rejected. Therefore, to counteract this limitation, for every bootstrap sample, the DCL set obtained after convergence is adjusted to contain all CL values below MIL from the CLG set. Robust Approach for Estimating P(H0). To make the calculation of P(H0) proposed in eq 11 more robust, the probabilities calculated in the previous steps using eqs 6−9) can be employed as further evidence. The information obtained from replicate samples for a given CL can be used to statistically estimate the variation in the analytical method response caused by intractable factors (e.g., experimental errors, 11402
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry instrumental errors, and sample handling). In this approach, the probability of being observed (P(H2|D)) for a given CL in the DCL set can be estimated using a simple linear interpolation of the probability distribution illustrated in Figure 3. If a more accurate value is desired, further experimental analysis with sufficient number of replicates to determine the actual probability of occurrence (the outcome of identifying COI when actually present) is an option. Once this information is obtained, instead of calculating P(H0) using eq 11, an informative probability that takes in to account the sensitivity and specificity of the analytical method can be calculated. Therefore, given that this step is approached as a random draw problem, P(H0) can also be defined as how probable it is for a random CL of a COI in the DCL set to be randomly drawn and observed by the analytical method, assuming it is truly present. It becomes apparent that this probability calculation involves two independent events. The first event refers to randomly selecting any of the CL in the DCL set, and the second event refers to whether the analytical method in question observes or does not observe the selected CL for COI. The probabilities for these two events can be formulated as follows: 1 n(DCL)
P(CLp) =
In eq 17, all parameters are as defined previously. If the probabilities for H0 and H1 are calculated independently, as proposed in eqs 16 and 17, both independently computed values have to be normalized as follows: ⎧ ⎪ P(H 0) = P(H 0) × nf ⎨ ⎪ ⎩ P(H1) = P(H1) × nf nf =
■
EXPERIMENTAL SECTION Computation. All functions for the simulation and computation were developed in MATLAB version: 8.2.0.701 (R2015a) and were implemented on an Alienware AURORA_R4, Intel(R) Core(TM) i7−3820 CPU, 3.60 GHz, 32 GB RAM, 64-bit Windows 7 Professional Operating System desktop computer. Data. Data used in Results and Discussion were obtained from The Netherlands Food and Consumer Product Safety Authority (NVWA) Web site, which can be accessed with the links in the references.37−39
(14)
P(D|H 2 , CLp) × P(H 2|CLp) [(D|H 2 , CLp) × P(H 2|CLp)] + [(D|H3, CLp) × P(H3|CLp)]
(15)
In eq 14, CLp refers to the event of randomly selecting a given CL from the DCL set, and n(DCL) refers to the number of CL in the DCL set. Taking in to account all events are equally probable, a flat probability distribution is assigned. In eq 15, the probability of the selected CLp to be identified by the analytical method in question leads to the same formulation (probability) proposed in eq 8. The combined probability of the events, taking in to account the exhaustive sets of mutually exclusive case scenarios supporting H2, leads to the following formulation:
■
RESULTS AND DISCUSSION To demonstrate the proposed method, data obtained from The Netherlands Food and Consumer Product Safety Authority (NVWA)37−39 and a simulated system (to represent existing analytical methods) were used. Because experimental demonstration (spike-in) is beyond the scope of this manuscript, and since the main aim of this work is to formulate a novel theoretical framework for the topic under discussion, real data and a simulated analytical method approach was sufficient. Furthermore, two forms of validation are presented: the first validation adopts the data for each pesticide from NVWA data set as the population sample (ground truth) and applies simulated analytical methods, and the second validation step adopts the combined result from all the pesticides of the NVWA data set as the population sample (ground truth) for a certain hypothetical COI and applies simulated analytical methods. This way, in contrast to creating random numbers for validation, a realistic CL levels are applied. A schematic of the demonstration and validation process can be found in Figure 4. Details on both validation steps can be found in the Supporting Information. Demonstration of the Method Based on Real Data and Simulated Analytical Method. The data obtained from NVWA consisted of 4896 screening results of 153 pesticides (compounds used in agricultural setting) from 2151 fruit and vegetable samples between January and July 2014 (Figure 5). The obtained data is assumed to fulfill two criteria; the first is that it gives a sufficient overview of random CL normally observed for the routinely screened pesticides, randomized over factors that could affect the outcome (biological matrix, source of sample, etc.), and the second is that the screening process to
C
P(H 0) ≈
× P(H 2|D, CLp)] ∑ [P(CLp) c c c=1
(16)
In eq 16, c refers to a single case from the sets of all mutually exclusive case scenarios, and C refers to the maximum number of possible case scenarios, which is equivalent to the number of elements in the DCL set. By definition of eq 2, once P(H0) has been estimated, the calculation of P(H1) in its simplest form is straightforward. However, in the same way that eq 16 incorporated the sensitivity of the analytical methods, if the specificity is to be incorporated in the calculation of P(H1), eq 2 can be ignored, and a direct calculation can be employed. For this, several blank samples can be used to calculate P(H3| D,CL0), where CL0 refers CL referring to no trace of COI being present. This will reflect the probability for COI not to be mistakenly identified when it is actually absent (referring to specificity). The general formulation to calculate P(H1) can be presented as follows: C
P(H1) ≈
∑ [P(CLp) × P(H3|D, CLp)] c=1
(19)
In eq 18, nf refers to the normalizing factor defined in eq 19. That way, Bayes’ rule for eq 3 is not violated. Once the final posterior probability has been calculated, in cases where a new analytical method (software, instrument) is developed and introduced, the probability calculated already can be updated by further propagating the posterior probability as a priori using Bayes’ theorem and incorporating the new analytical method into the computation.
P(H 2|D, CLp) =
1 P(H1) + P(H 2)
(18)
(17) 11403
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry
COI in question was observed from total number of samples. As discussed in Theory, this approach leads to the fundamental question of how many samples is sufficient, as well as what is the probability that there is a random case that falls outside the range reflected by the data. With the method proposed in this manuscript, these two fundamental questions can be addressed. To proceed with the demonstration of the method, a thorough experimental analysis (explained in Theory) is needed. A simple simulated system which applies a random number generator to reflect the analytical method performance boundaries (a value representing the minimum CL required for the analytical method in question to positively identify COI when truly present) was employed (details in Supporting Information). Therefore, in addition to the data obtained from NVWA, two instruments, one for each technique (LC-MS/MS and GC-MS/MS), and three software types were assumed to represent the existing analytical methods used at different toxicology laboratories. Using the random number generator, each analytical method (software and instrument) was assigned a value corresponding to its performance boundary (in this case a value of 0.61575 for software one, 0.82003 for software two, 0.30148 for software three, 0.20725 for LC-MS instrument, and 0.4115 for GC-MS instrument). Once these analytical boundaries were assigned, it was adopted for both demonstration and validation purposes. In order to estimate the necessary analytical method parameters (MIL, MDL, C*, and DSL), the probability distribution of observation versus CL for COI is necessary. For this step, the formulation described in eqs 6−9) was employed in an iterative process using a simulated numerical grid (representing spiked samples). This process is a representation of the analytical step (as proposed in Theory) necessary to estimate the analytical method parameters using biological samples spiked with a known CL of the COI (details on how the iterative process was performed can be found in the Supporting Information). In a similar fashion, as illustrated in Figure 3, values of 0.2074, 0.00046, 0.4119, and 0.8201 mg/kg
Figure 4. Graphical representation of method demonstration and validation.
obtain the data has made an exhaustive use of existing analytical methods. Given these assumptions, for the proposed method, the variable of interest would be the observed CL of the pesticides, which reduces the dimensionality of the data to one. Therefore, for demonstrative purposes, the CL of three example pesticides; 415 values for Imazalil, 375 values for Thiabendazole, and 243 for Ortho-phenylphenol were extracted (Figure 5). Taking in to account there are 2151 samples in the data, the extracted information for each pesticide also consisted 2151 CL values, where a value of 0 (CL referring to no trace of the pesticide) is assigned for those samples in which the pesticide in question has not been observed. In a simplistic approach, the probability of a random sample to contain COI can be calculated by observing how frequently the
Figure 5. Graphical representation of 4896 screening results of 153 pesticides from 2151 fruit and vegetable samples between January and July 2014, adopted from The Netherlands Food and Consumer Product Safety Authority (NVWA). 11404
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry
Figure 6. (A−C) Graphical representation of the classification (clustering) result of the continuous concentration level (CCL) versus the discretized concentration level (DCL) of the example bootstrap samples for imazalil, thiabendazole, and ortho-phenylphenol, respectively. (D−F) Histogram representation of the normally distributed posterior probability obtained by exhaustive bootstrap sampling for imazalil, thiabendazole, and orthophenylphenol, respectively.
were extracted and assigned to MIL, C*, MDL, and DSL, respectively. This parameters were then used to create a CLG between 0 and a DSL adopting C* as an interval. The CLG was then used to translate the observed CCL in to the DCL for all three example pesticides. To perform the discretization step efficiently, the clustering approach described in Theory was employed. Figure 6A−C illustrates the translation of the CCL into the DCL taking few bootstrap samples as an example; however, exhaustive probabilistic bootstrap sampling was performed. After the DCL for each bootstrap sample was acquired, the P(H0) was then calculated with the simplest approach using eq 12. A collection of probabilities, each obtained using a single bootstrap sample, resulted in a posterior probability distribution. Given this probability distribution, assumed to be normally distributed (depicted in Figure 6D−F), the mean was accepted as the final maximum likelihood estimate for P(H0). Therefore, based on the real sample obtained from NVWA, and the simulated analytical methods, the final P(H0), reflecting how probable it is for a random sample to contain imazalil, thiabendazole, or ortho-phenylphenol, were estimated to be 0.0131, 0.0253, and 0.0152, respectively. The obtained result is based on the simulated analytical system used to demonstrate the method and is not a reflection of the true probability in reality. If one needs to obtain the precise probability with the method proposed in this manuscript, the experimental analysis explained in Theory should be meticulously followed. Validation of the method proposed however can be found in the Supporting Information.
In toxicology screening (either in forensics or food safety), due to uncertainties involved in the analytical process, the application of probabilistic models is becoming more prevalent. In this respect, Bayesian statistics has made a significant contribution to this field. If Bayesian statistics is applied in toxicology screening, to confidently identify and confirm a COI in a given random sample would require a prior belief about how probable it is to find the COI within the sample. In contrast to earlier suggestions regarding making use of the available literature or the subjective knowledge of the analyst, a novel theoretical framework to calculate a nonsubjective prior, which takes into account the sensitivity and specificity of analytical methods and incorporates the evidential and physical form of available data, has been proposed, demonstrated, and validated for application.
■
ASSOCIATED CONTENT
* Supporting Information S
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.5b02916. Simulation of analytical method performance boundaries, grid representation of spike-in biological samples, estimation of COI observation probability distribution, extraction of analytical method parameters from the simulated system, K-means clustering, probabilistic bootstrap sampling, validation one, and validation two (PDF).
■
■
CONCLUSION Currently, Bayesian statistics is being applied in almost all fields of science. Bayesian approach provides an optimal, elegant, and rational approach for answering scientific question for a given state of information. Its capability to combine prior information with data using Bayes’ theorem within a solid decision framework and to reach a more optimal posterior probability is one of the great advantages of Bayesian statistics. Thus, when new observations become available, the previous posterior distribution can be further propagated as a prior to the next step.
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Tel.: +31 20 5257040. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS This work was funded by private partners (DSM Resolve, RIKILT, and NFI) and NWO and was supported by the COAST Project (Project No. 053.21.105). 11405
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406
Article
Analytical Chemistry
■
REFERENCES
(1) Armstrong, N.; Hibbert, D. B. Chemom. Intell. Lab. Syst. 2009, 97, 194−210. (2) Ashby, D. Stat Med. 2006, 25, 3589−631. (3) Chen, H.; Bakshi, B. R.; Goel, P. K. Anal. Chim. Acta 2007, 602, 1−16. (4) Hobbs, B. P.; Carlin, B. P. J. Biopharm. Stat. 2007, 18, 54−80. (5) Skrepnek, G. H. PharmacoEconomics 2007, 25, 649−64. (6) Cox, R. T. Am. J. Phys. 1946, 14, 1−13. (7) D'Agostini, G. Rep. Prog. Phys. 2003, 66, 1383−1419. (8) Armstrong, N.; Hibbert, D. B. Chemom. Intell. Lab. Syst. 2009, 97, 211−220. (9) Greenland, S.; Poole, C. Epidemiology. 2013, 24, 62−8. (10) de Elia, R.; Laprise, R. Mon. Weather Rev. 2005, 133, 1129− 1143. (11) Levine, B. Anal. Chem. 1993, 65, 272A−276A. (12) Herrera-Lopez, S.; Hernando, M. D.; García-Calvo, E.; Fernández-Alba, A. R.; Ulaszewska, M. M. J. Mass Spectrom. 2014, 49, 878−93. (13) Blair, I. A. Chem. Res. Toxicol. 1993, 6, 741−747. (14) Raisys, V. J. Chem. Educ. 1985, 62, 1050. (15) Van Bocxlaer, J. F.; Clauwaert, K. M.; Lambert, W. E.; Deforce, D. L.; Van den Eeckhout, E. G.; De Leenheer, A. P. Mass Spectrom. Rev. 2000, 19, 165−214. (16) Maurer, H. H. Clin. Chem. Lab Med. 2004, 42, 1310−24. (17) Hoja, H.; Marquet, P.; Verneuil, B.; Lotfi, H.; Penicaut, B.; Lachatre, G. J. Anal. Toxicol. 1997, 21, 116−26. (18) Rosano, T. G.; Wood, M.; Swift, T. A. J. Anal. Toxicol. 2011, 35, 411−23. (19) Berendsen, B. J.; Stolker, L. A.; Nielen, M. W. J. Am. Soc. Mass Spectrom. 2013, 24, 154−63. (20) Bethem, R.; Boison, J.; Gale, J.; Heller, D.; Lehotay, S.; Loo, J.; Musser, S.; Price, P.; Stein, S. J. Am. Soc. Mass Spectrom. 2003, 14, 528−541. (21) Heller, D. N.; Lehotay, S. J.; Martos, P. A.; Hammack, W.; Fernndez- Alba, A. R. J. AOAC Int. 2010, 93, 1625−1632. (22) De Zeeuw, R. A. J. Chromatogr. B: Anal. Technol. Biomed. Life Sci. 2004, 811, 3−12. (23) Pesyna, G. M.; McLafferty, F. W.; Venkataraghavan, R.; Dayringer, H. E. Anal. Chem. 1975, 47, 1161−1164. (24) Pesyna, G. M.; Venkataraghavan, R.; Dayringer, H. E.; McLAfferty, F. W. Anal. Chem. 1976, 48, 1362−1368. (25) Woldegebriel, M.; Vivó-Truyols, G. Anal. Chem. 2015, 87, 7345−55. (26) Vivo-Truyols, G. Anal. Chem. 2012, 84, 2622−2630. (27) Martin, S. K.; Farnan, J. M. J. Grad. Med. Educ. 2013, 5, 159− 160. (28) Morita, S.; Thall, P. F.; Muller, P. Stat Biosci. 2010, 2, 1−17. (29) Feller, W.. An Introduction to Probability Theory and Its Applications, 3rd ed.; Wiley: New York, 1968. (30) Milman, B. L. Anal. Chem. 2002, 74, 1484−92. (31) Milman, B.L. J. J. Chem. Inf. Model. 2005, 45, 1153−8. (32) Milman, B. L.; Konopelko, L. A.; Fresenius, J. Fresenius' J. Anal. Chem. 2000, 367, 621−628. (33) Milman, B. L. TrAC, Trends Anal. Chem. 2005, 24, 6. (34) Long, G. L.; Winefordner, J. D. Anal. Chem. 1983, 55, 713A− 724A. (35) Babu, G. P.; Murty, M. N. Pattern Recogn. Lett. 1993, 14, 763− 769. (36) Wehrens, R.; Putter, H.; Buydens, L. M. Chemom. Intell. Lab. Syst. 2000, 54, 35−52. (37) https://www.nvwa.nl/txmpub/files/?p_file_id=2207593 (accessed May 3, 2015). (38) https://www.nvwa.nl/txmpub/files/?p_file_id=2207592 (accessed May 3, 2015). (39) https://www.nvwa.nl/txmpub/files/?p_file_id=2207591 (accessed May 3, 2015).
11406
DOI: 10.1021/acs.analchem.5b02916 Anal. Chem. 2015, 87, 11398−11406