Estimation of Complete Discrete Multivariate Probability Distributions

Apr 15, 2014 - ABSTRACT: This paper presents a method of estimating discrete multivariate probability distributions from scarce historical data. Of pa...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/IECR

Estimation of Complete Discrete Multivariate Probability Distributions from Scarce Data with Application to Risk Assessment and Fault Detection Taha Mohseni Ahooyi,† Jeffrey E. Arbogast,‡ Ulku G. Oktem,∥ Warren D. Seider,§ and Masoud Soroush*,† †

Department of Chemical and Biological Engineering, Drexel University, Philadelphia, Pennsylvania 19104, United States American Air Liquide, Newark, Delaware 19702, United States § Department of Chemical and Biomolecular Engineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, United States ∥ Risk Management and Decision Processes Center, Wharton School, Philadelphia, Pennsylvania 19104, United States ‡

ABSTRACT: This paper presents a method of estimating discrete multivariate probability distributions from scarce historical data. Of particular interest is the estimation of the probabilities of rare events. The method is based on maximizing the information entropy subject to equality constraints on the moments of the estimated probability distributions. Two criteria are proposed for optimal selections of the moment functions. The method models nonlinear and nonmonotonic relations with an optimal level of model complexity. Not only does it allow for the estimation of the probabilities of rare events, but, together with Bayesian networks, it also provides a framework to model fault propagation in complex highly interactive systems. An application of this work is in risk assessment and fault detection using Bayesian networks, especially when an accurate first-principles model is not available. The performance of the method is shown through an example.

1. INTRODUCTION Risk assessment usually refers to a set of analyses that identify potential hazards and evaluate possible consequences of the hazards when they occur.1 It involves estimating (a) the likelihoods of different possible risky situation scenarios and (b) the costs associated with the risks. More specifically, risk assessment includes simultaneous development of realistic fault scenarios, estimation of failure cost, and quantification of risk probabilities. Existing risk assessment and fault detection schemes primarily involve abnormal situations with high probabilities and moderate costs,2 whereas a major fraction of catastrophic and large-scale incidents with highly destructive consequences have been caused by some triggering events whose probabilities had been found infinitesimal by risk assessment. These abnormal events are usually referred to as “rare events”,3 which are of two major types: (a) those so rare and far-fetched that their probabilities may be considered to be practically zero, and (b) those actually predictable but showing a minor recurrence frequency, compared to the expected lifetime of the plant. An example of the first type is industrial plant destruction due to a meteor hitting the plant site exactly, and an example of a rare event of the second type is a control system failure. Throughout this paper, the term “rare events” is used for those of the second type. Although the probabilities of rare events may be predictable, often reducing the negative impacts of their consequences, modern industrial establishments are still suffering from the resulting catastrophes for two major reasons. First, the probabilities of rare events are usually underestimated intentionally or inadvertently, eventually leading to false assurances that associated risks also are negligible. Consequently, their © 2014 American Chemical Society

associated risks are not taken seriously, with rudimentary precautionary schemes used to mitigate them. Second, probability estimates of events that have seldom or never happened, or are rarely observed or recorded during plant operations, carry a great deal of uncertainty. Because of the high estimation uncertainty, the estimates are hardly useful in practice. The uncertainty is mainly due to the lack of a general method of integrating the rarity and “extrapolation into the future” of the system behavior. Hence, the reliable estimation of probabilities that are unknown, infinitesimally small, and hard to predict, is an open problem. This estimation problem becomes more challenging and interesting when of interest are complex failure scenarios in which the objective is not to determine the probability of a plant component failure but to calculate the probabilities of a chain of subsequent failures. Rare-event probability estimation has been studied in the past decade. In cases where an accurate plant model is available, efforts have been made to generate data samples using the model. Methods such as Markov-Chain Monte Carlo,4 importance sampling,5 and splitting6 have been employed extensively to calculate probability distributions to identify abnormal situations. However, when a reliable model, especially one accounting for uncertainties, is not available, sampling methods fail to provide a complete representation of the system’s behavior. For such cases, probability estimation Special Issue: John Congalidis Memorial Received: Revised: Accepted: Published: 7538

December 16, 2013 April 12, 2014 April 15, 2014 April 15, 2014 dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

predictability) about the system grows with time from the last available observation. The information entropy of a discrete random variable X, denoted by S(X), is defined as25

methods have been developed to estimate probabilities of random events from data. These methods are either nonparametric or parametric.7,8 The most basic nonparametric method is the histogram method.7 Parametric methods9 consider a parametric family, such as copula densities10 and moment-based probabilities,11 to represent the observed data. Other complex probabilistic structures have also been proposed to maximize the flexibility of the estimated models. While existing probability density estimation techniques have the advantages noted, they suffer from high computational cost (as in the nonparametric kernel method12) or are unable to describe all dependence structures (as in the conventional copula methods).13 Once complete probability distributions are estimated, (a) risk can be predicted and (b) inferences are made to perform fault detection and identification using Bayesian networks.14 Alternative fault detection and identification methods use Kalman filtering,15 principal component analysis,16 fault and event trees,17 fuzzy-logic-based modeling,18 artificial neural networks19 or other concepts.20 In this paper, we propose a method of estimating discrete multivariate probability distributions from scarce historical data. Of particular interest is the estimation of the probabilities of events not observed before (i.e., rare events). The method is based on a constrained maximization of the information entropy. It uses sample moments, which contain information encoded in the data. The resulting probability distribution models will then be used to develop Bayesian networks and conduct risk analysis and fault detection in the rare-event regime. Unlike traditional approaches that use relative frequency to estimate probabilities, this method incorporates all information in datasets to estimate complete discrete multivariate probability distributions covering all regions including unobserved ones. These probability distributions permit the calculation of unknown or near-zero probabilities much faster than sampling from first-principles models. Furthermore, nonlinear and nonmonotonic relations are modeled with an optimal level of complexity. Also, the use of the proposed method, together with Bayesian networks, allows calculation of the probabilities of multilevel risk scenarios involved with interactive plant components. This work is an extension of our previous work21 on estimating probability density functions of continuous random variables. Two approaches to finding the optimal complexity level of the estimated probability distributions are also presented. The rest of the paper is organized as follows. Section 2 describes the constrained maximum-entropy (ME) probability estimation method for discrete random variables and the criteria-based moment function selection. Section 3 presents an application of the method to an example Bayesian network. Finally, conclusions are presented in section 4.

N

S(X ) = −∑ P(X = xi) ln P(X = xi)

(1)

i=1

where P(X = xi) is the probability of X = xi and N is the number of the states of X (discrete values that X can take). In information theory, maximization of the information entropy is frequently used to ensure that the prior artificial assumptions included in knowledge-based systems are minimized.26 This procedure leads to less biased models. When mass probability distributions (PMFs) are estimated by maximizing the information entropy without imposing constraints on the shape of the distributions, then uniform distributions, in which all states have equal probabilities of occurrence, are obtained. In such a system, the outcome of the random process is absolutely uncertain; that is, no outcome is more likely than the others. However, for every process, there is usually some information, no matter how uncertain, that can be used to impose constraints on the entropy maximization to obtain a more specific and informative probability distribution. 2.2. Moments of a Probability Distribution. Given the probability distribution of a discrete random variable X, its theoretical moment with respect to a moment function, gk(X), is N

μk (X ) =

∑ gk(X = xi)P(X = x i)

k = 0 , ..., q (2)

i=1

For example, when gk(X) = X, N

μk (X ) =

∑ xiP(X = xi) = E(X )

(3)

i=1

where E(X) is the expectation or mean of X. When gk(X) = (X − E(X))2, then μk = E((X − E(X))2) represents the degree of diffuseness of the PMF or variance, and when gk(X) = ((X − μ)/σ)3, E([(X − μ)/σ]3) provides information on the skewness or asymmetry of the distribution, where μ and σ represent the mean and standard deviation of the distribution, respectively. The definitions of the moments can be extended to the multivariate PMFs, which are of particular interest because of their capabilities of modeling joint probabilities. Estimations of joint PMFs lead to complete conditional probabilities, which are required to train Bayesian networks. For multivariate PMFs, the moments of a vector of random variables, X⃗ = T d×1 [ X1 ··· Xd ] ∈ Λ ⊂  , are N

μ k (X ⃗ ) =

∑ gk(X⃗ = xi⃗ )P(X⃗ = xi⃗ ) i=1

2. METHOD 2.1. Entropy Maximization. To estimate the probabilities of unobserved events from the probabilities of observed ones, we combine two concepts widely used in statistical learning22 to estimate complete probability distributions. The first concept is the information entropy introduced by Claude Shannon23 as an informatics equivalent of the thermodynamic entropy, which represents disorder. The maximum entropy principle24 states that every system loses its information content gradually, whether intrinsically (similar to that seen in nature) or observationally, where the degree of uncertainty (lack of

k = 0 , ..., q (4)

where gk(X⃗ ): Λ →  is a moment function, and P(X⃗ ): Λ →  is a joint PMF. In this multivariate case, the information entropy is N

S(X⃗ ) = −∑ P(X⃗ = xi⃗ ) ln P(X⃗ = xi⃗ ) i=1

Given the PMF of a discrete random vector X⃗ , using eq 4, moments can be calculated. The empirical (sample) moment (μ̅k) of the corresponding sampled population is given by 7539

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

Table 1. Several Exponential Probability Distribution Functions and Their Moments distribution

moment

density function

Exponential

∫ xf (x) = μ

1 μ

f (x) =

exp

( ) −x μ

Gaussian

∫ (x − μ)2 f (x) = σ 2

∫ xf (x) = μ ,

1 σ 2π

f (x) =

exp

(

−(x − μ)2 2σ 2

)

Beta Γ ′ (a) Γ ′ (a + b) − Γ(a + b) , Γ(a) Γ ′ (b) Γ ′ (a + b) f (x) = Γ(b) − Γ(a + b)

∫ ln xf (x) =

(

x a − 1(1 − x)b − 1 B(a , b)

f (x) =

)

∫ ln 1 − x Gamma

Γ ′ (a) Γ(a)

∫ xf (x) = a ,

∫ ln xf (x) =

∫ x af (x) = 1,

∫ ln xf (x) = −( ba )

f (x) =

x a − 1 exp(−x) Γ(a)

Weibull

μk̅ =

1 n

n

∑ gk(χj⃗ )

(5)

where χ⃗j ∈ d × 1 represents the jth data sample, and n is the number of the data samples. The sample moments carry information about the underlying distributions of data; therefore, they are used to estimate the probability distributions that govern the samples. Seeking a PMF solely based on the sample moments usually yields the maximum-likelihood PMF.27 As more moments are included in the form of constraints, more information from the samples is included in the estimated PMF. 2.3. Constrained Entropy Maximization. To constrain the maximization of the information entropy, a PMF, whose moments are the same sample moments, is sought. In other words, given a set of moment functions, we seek optimal model probabilities, P̂̂ 1, ...,P̂̂ N, which are the solution to the following constrained optimization problem: ⎫ ⎧N (P1̂ , ..., PN̂ ) = arg max S(X⃗ ) = arg min⎨∑ Pi ln[Pi ]⎬ ⎭ P1 , ..., PN P1 , ..., PN ⎩ i = 1 ⎪







x b

x a b

( ( ))

exp −

Ω1 = {x1 , ..., xd}, Ω 2 = {xixj|i , j = 1 , ..., d}, Ω3 = {xixjxk|i , j , k = 1 , ..., d}, Ω4 = {xixjxkxl|i , j , k , l = 1 , ..., d} ⋮

(6)

(9)

Thus, eq 7 becomes N

N i=1

a b

Ω 0 = {1},

subject to

∑ gk(X⃗ = xi⃗ )Pi ̂ = μk̅

a−1

( )( )

that can completely characterize several widely used continuous distribution functions. For example, a Gaussian (normal) distribution is fully characterized by its mean and variance. However, since the actual distribution that has given rise to the observed data is generally unknown, a decomposition technique is used here to approximate the actual moment functions underlying the observed samples. To simplify the search for the appropriate moment functions, it is proposed here to search for the moment functions gk (x ⃗) ∈ ∪O i = 0 Ωi , where O is a positive integer to be selected by the user, and

k = 0 , ..., q

j=1

f (x) =

∑ gk(X⃗ = xi⃗ )P(̂ X⃗ = xi⃗ ) =

k = 0 , ..., q

i=1

(7)

∀ g k (x ⃗ ) ∈

where Pi ̂ = P(̂ X⃗ = xi⃗ ) ∈ [0, 1] and g0(x ⃗) = 1. Therefore, given the moment functions, the constrained optimization problem of eqs 6 and 7 is a conventional nonlinear program, which is easy to solve numerically, using a global optimization technique.28 2.4. Selection of the Moments. To estimate PMFs reliably, it is essential to select appropriate moment functions. These moment functions provide the PMFs with sufficient cumulative information extracted from data. Not only can the moment function selection significantly improve the accuracy of the estimation, but it also provides a means to control the computational complexity of the optimization step with minimum information loss due to coarse discretization of the distribution functions. For example, estimation of the Gaussian distribution requires only the first moment (mean) and the second central moment (variance). The other moments do not provide much additional knowledge from the data to arrive at a different estimated distribution. Table 1 lists moment functions

∪O i=0

Ωi

1 n

N

∑ gk(χi⃗ ), i=1

(10)

which indicates that the user just needs to select an appropriate value for the positive integer O, instead of selecting appropriate moment functions needed in the original formulation of eq 7. The use of a higher O would appear to provide higher accuracy. However, because (a) the use of a higher O imposes higher computational costs, and (b) more-complex moment functions often fail to predict the actual probability behavior outside the range of the data (due to overfitting),29 one should select an adequately large O, as suggested by Occam’s razor principle.30 Hence, there is a tradeoff between information loss in lower-order moment functions and high variance of higherorder ones, particularly when predictions of probabilities outside the observed regions are intended. Consequently, the lowest level of complexity (O), which satisfies an error tolerance threshold, should be chosen. To find optimal values of O systematically, we propose two criteria: the Maximum 7540

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

Table 2. Actual Probability Distributions and Probability Distributions Based on 100 Randomly Selected Samples variable X

Y

Z

U

W

x actual P(x) sampled P(x) y actual P(y) sampled P(y) z actual P(z) sampled P(z) u actual P(u) sampled P(u) w actual P(w) sampled P(w)

State S1

State S2

State S3

State S4

State S5

State S6

State S7

0.50 0.010 0.010 −2.50 0.001 0.000 0.48 0.003 0.010 −1.04 0.000 0.000 −1.87 0.000 0.010

1.50 0.150 0.170 −1.50 0.050 0.110 0.85 0.011 0.000 −0.71 0.001 0.000 −1.12 0.042 0.090

2.50 0.700 0.730 −0.50 0.110 0.090 1.21 0.163 0.210 −0.38 0.007 0.010 −0.36 0.157 0.150

3.50 0.080 0.050 0.50 0.550 0.500 1.58 0.670 0.680 −0.06 0.024 0.040 0.40 0.556 0.510

4.50 0.005 0.000 1.50 0.190 0.180 1.95 0.096 0.060 0.27 0.383 0.450 1.16 0.193 0.190

5.50 0.050 0.040 2.50 0.090 0.120 2.31 0.050 0.040 0.60 0.529 0.460 1.92 0.050 0.050

6.50 0.005 0.000 3.50 0.009 0.000 2.67 0.007 0.000 0.92 0.056 0.040 2.67 0.002 0.000

Likelihood Estimate (MLE) of O and the Maximum a Posteriori Estimate (MAPE) of O. 2.4.1. Maximum Likelihood Estimate (MLE) of O. The likelihood of a parameter O, given a database D, is simply defined as the conditional probability of D given O: L(O|D) = P(D|O)

value of O that maximizes this (posterior) probability. Using Bayesian model averaging:32 Omax

P(X = xi|D) =

(11)

P(X |D) ≃ P(X |D, OMAP)

O

(15)

where OMAP = arg max P(O|D)

(12)

O

where P̂i,o denotes the model prediction of the probability of state i using the moment functions up to order O and ni represents the number of data points in the ith state. Note that ∑Ni=1ni = n. Similar to the behavior observed in bias-variance tradeoff,31 as the complexity level of a model increases, better agreement with the data is achieved and the likelihood function increases. These trends continue to a certain complexity level beyond which they reverse; with decreased likelihood of the data (a measure of the accuracy of the model), while the computational cost increases. The value of O that yields the closest agreement is called the maximum likelihood estimate (MLE) of the parameter O: OMLE = arg max P(D|O)

(14)

where Omax stands for the maximum value of O that the user should select. Equation 14 allows one to average over different complexity levels to derive a distribution for P(X|D). However, it is often impossible to calculate this sum. As a general solution, P(X|D) is approximated by

where L(·|·) and P(·|·) are the likelihood function and conditional probability, respectively. Therefore, to calculate the likelihood function, conditional probabilities must be available. Such a definition is the basis of the maximum likelihood estimation.27 The likelihood function indicates how well the observed data samples are described using the parameter O. In the case of a polynomial distribution, ̂ ni L(O|D) = P(D|O) = ΠiN= 1Pi ,o

∑ P(X = xi|D, O)P(O|D) O=1

= arg max O

P(D|O)P(O) P(D)

= arg max P(D|O)P(O) O

(16)

The parameter O is also known as the complexity controlling parameter. P(O) denotes the prior probability of O. Since the likelihood function, P(D|O), as stated by eq 12, generally does not have a closed form, setting up a conjugate prior for the likelihood function is not always possible. However, an informative priorfor example, a discrete normal distributioncan still be assigned. As more information is incorporated into this function through the likelihood term, the updated belief about O approaches its actual value. When the mode of this posterior is used as the point estimate of O, this estimation is called the maximum a posteriori estimation. Equation 16 implies that when a uniform distribution is used as the prior probability, equal MLE and MAPE estimates of O are obtained. In the next section, we apply these concepts and algorithms to an example Bayesian network.

(13)

which agrees with Occam’s razor principle. However, since the maximum of the likelihood function is at a high O value and the likelihood function does not change significantly within a wide range of low O values, the user may select a low O value that satisfies a low value of the goodness-of-fit criterion while keeping the computations tractable. 2.4.2. Maximum a Posteriori Estimate (MAPE) of O. The Bayes rule can be used to relate the likelihood and the prior probability of O, leading to the development of a framework that can incrementally update the probability of O. Unlike the MLE, which is a single value of O, the Bayesian approach comes up with a probability distribution of O. MAPE is the

3. APPLICATION TO AN EXAMPLE To demonstrate the performance of the PMF estimation method, we apply the method to a plant with five variables:

7541

X ∼ PMF1

(Table 2)

(17)

Y ∼ PMF2

(Table 2)

(18)

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

Z = X1/2 + Normal(0, 0.1)

(19)

U = log(Z) + Normal(0, 0.03)

(20)

YZ W= + Normal(0, 0.2) Z+1

2b are used to generate 10 000 000 samples of X and 10 000 000 samples of Y, using xi = FX−1(τxi),

yi = FY−1(τy) i

i = 1 , ..., 10 000 000 (22)

(21)

−1

FY−1

FX and are obtained by summing the probabilities given in the PMFs of X and Y, respectively, shown in Table 2. Fourth, the generated samples of Z, U, and W are distributed among the seven states of each of the three variables, as shown by Table 2. To simulate a scarce-data scenario, we generate only 100 random samples of each of the five variables X, Y, Z, U, and W using the same approach employed above to generate the 10 000 000 samples. The resulting 100-sample distributions of the five variables are shown in Table 2. Our probability estimation will be based on these 100-sample distributions. As can be seen, the observed probabilities of some states of each variable are absolutely zero, which is in accordance with Chebyshev’s inequality.33 In detecting faults using a Bayesian network, zero probability of a state can be problematic; in Bayesian inference, the probabilities of all states, including rare states, should be known. For example, to perform Bayesian (backward or forward) inference, given evidence in a rare state, conditional probabilities of other variables, given the rare-state evidence, should be known. 3.1. Maximum Entropy (ME) Estimation of Probabilities. Given the incomplete probabilities of the five variables shown in Table 2, we use the ME method presented in the previous section to estimate complete probabilities of all variables. For an efficient estimation, we initially need to apply the ME method with different values of O. The MLE or MAPE of O is then used to estimate the probabilities of the variables. A nonconjugate prior normal distribution P(O) with a mean of 2 is employed to calculate the MAPE. The mean of 2 can be considered as our prior assumption that second-order and lower moment functions can approximate many elliptical distributions well. This prior assumption is then updated when additional information from the data is incorporated in the maximum a posteriori estimation of O. Tables 3 and 4 compare the MLE and MAPE of O for X and Y. Because of the use of a mean of 2 for P(O), OMAPE is smaller than OMLE. As a consequence of multimodality of the actual PMF of X, higher-order moment functions (complexity) must be used to model X compared to Y. Table 5 compares X and Y

where Normal(μ, σ) represents a normal distribution with a mean of μ and a standard deviation of σ. PMF1 and PMF2 are given in Table 2. The three white noise parameters given in eqs 19−21 represent internal plant uncertainties. In addition to having the uncertainties, the plant is nonlinear. Figure 1 shows a Bayesian network representation of the plant example; it shows the causal relationships among the variables, directly extracted from eqs 19−21.

Figure 1. Bayesian network of the example.

Given the actual PMFs of X and Y by Table 2, the actual PMFs of Z, U, and W are generated by (a) taking 10 000 000 samples from the actual PMFs of X and Y and the probability distributions of the three white noise parameters and (b) using eqs 19−21. To take the 10 000 000 samples from the actual PMFs of X and Y, the following approach is used. First, two random numbers, τx and τy, each with a uniform distribution on the interval [0,1], are defined. Second, the random variable τx is sampled 10 000 000 times to generate random numbers τx1, ..., τx10 000 000. This same approach is used to generate random numbers τy1, ..., τy10 000 000 Third, the inverses of the cumulative probability distributions of X and Y described by Figures 2a and

Figure 2. Generation of random numbers from cumulative probability distributions of X and Y. 7542

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

Table 3. Likelihood of O for X and Y X

distribution of each estimated state of each variable is then calculated. The calculated means and standard deviations are listed in Table 6. Each distribution indicates how reliable the estimated probability of the state is. Assuming the estimator is unbiased; that is, the mean of the distribution is equal to the actual value of the state probability, a narrower distribution (with smaller standard deviation) indicates that the mean value of the distribution is more likely to be the actual value of the estimated state probability. A narrow distribution with a small sample size can be a sign of the consistency of the estimation method. It should be noted that, in addition to studying the behavior of the estimated parameters for different input training data, the outlined approach can be used to avoid uncertainty, because of the randomness of the small datasets, i.e., while two different small datasets may give rise to noticeable discrepancy between the results, this resampling technique can present more reliable estimation of the parameters by aggregation. Therefore, in the applications where a single value variable is required (as in this example), the mean of the distribution produced by the resampling method can be used as an alternative. More details can be found in ref 35. 3.3. Bayesian Network Risk Analysis. Once the complete marginal and conditional probabilities are estimated using the ME method described in section 2, the Bayesian network can be used to conduct inference based on the Bayes rule.32 Generally, there are two types of Bayesian inference: forward and backward, depending on where evidence is introduced to the network. Predictive (Forward). In this case, the evidence has information on the parent node(s)/variables, and the probabilities of the child variables are updated given the state of their parent(s). This type of inference allows one to develop abnormal event propagation scenarios, find the most probable abnormal consequences encountered within a system, and calculate complex chain failure probabilities through risk assessment procedures. Figure 4b shows the updated network once evidence indicating that X is in state S1, is introduced. Such evidence together with conditional probabilities and the Bayesian network updating such rules36 permits updating the probability of every state of every other variable/nodes, including those that have not shown up in historical data. One then can calculate and evaluate the deviation of the posterior probability of each state from the prior probability of the state (normal operation value of the state), determining variables that show more potential to be at a dangerous abnormal state. In this case, once the evidence X being in its lowest state is given to the network, the network shows a large deviation in its most closely connected child node, Z, and a weaker deviation in U. However, these deviations do not affect W significantly, as confirmed by eq 21. Diagnostic (Backward). In this case, the evidence has information about a child node(s). Since such evidence is introduced to the network and the probabilities of the states of

Y

O

Likelihood of O

O

Likelihood of O

27

0.0000 0.0011 0.0233 0.0472 0.0480 0.0481 0.0482 0.0495 0.0588 0.0032 0.0003 0.0000

48

0.0000 0.0003 0.0238 0.0240 0.0242 0.0254 0.0129 0.0003 0.0000

PMFs estimated with MLE and MAPE of O, with the actual X and Y PMFs. An important point to note here is that likelihood functions stay constant for a wide range of O. This suggests that, for the cases where computational tractability of the algorithm is the limiting factor, the lowest O value that satisfies a likelihood threshold can be used to model the PMF. A similar approach is utilized to estimate the joint PMFs. The Bayes rule is then used to calculate conditional probabilities; for example, P(U Z = z) =

P(Z = z , U ) P(Z = z)

Figure 3 shows a comparison of the actual and ME-estimated conditional probabilities of W for Y and Z in their lowest states (Figure 3a), and Y and Z in their highest states (Figure 3b). As this figure shows, the estimated probabilities are more diffuse than the actual distributions. This is mainly due to the estimation method maximizing the information entropy, leading to higher uncertainties in the estimated distributions. Given that, in most cases, the rare states correspond to abnormal situations, the estimated complete distributions can be used to calculate the probabilities of fault propagation scenarios. For example, given independent nodes Y and Z in their highest states, the probabilities of Y, Z, and W being in their highest state simultaneously can be calculated. Figure 4a shows the corresponding Bayesian network drawn using the Netica34 software. We also use this software for Bayesian analysis. 3.2. Confidence Intervals for the Estimated Probabilities. To calculate confidence intervals for the estimated PMFs, we use a resampling method called the Jacknife, or the leave-one-out procedure (described in ref 35). In this method, a distribution for the estimated probability of each state of each variable is calculated by performing an optimization procedure multiple times: each time a different sample set and a state estimate is obtained by systematically leaving one of the 100 sample points out. The mean and standard deviation of the Table 4. Prior and Posterior Probabilities of O for X and Y O=0

O=1

O=2

O=3

O=4

O=5

O=6

O of X

variable prior probability posterior probability

0.000 0.000

0.072 0.000

0.855 0.000

0.072 0.018

0.000 0.982

0.000 0.000

0.000 0.000

O of Y

prior probability posterior probability

0.000 0.000

0.072 0.000

0.855 0.899

0.072 0.100

0.000 0.000

0.000 0.000

0.000 0.000

7543

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

Table 5. Actual (Based on 100 Samples) and Estimated PMFs of X and Y State S1

State S2

State S3

State S4

State S5

State S6

State S7

X

variable actual probability sampled probability MLE probability MAPE probability

0.0100 0.0100 0.0100 0.0104

0.1500 0.1700 0.1693 0.1676

0.7000 0.7300 0.7269 0.7320

0.0800 0.0500 0.0498 0.0460

0.0050 0.0000 0.0031 0.0060

0.0500 0.0400 0.0398 0.0376

0.0050 0.0000 0.0012 0.0004

Y

actual probability sampled probability MLE probability MAPE probability

0.0010 0.0000 0.0012 0.0072

0.0500 0.1100 0.1096 0.0654

0.1100 0.0900 0.0897 0.1974

0.5500 0.5000 0.4983 0.3502

0.1900 0.1800 0.1794 0.2972

0.0900 0.1200 0.1196 0.0753

0.0090 0.0000 0.0023 0.0072

Figure 3. Actual and estimated conditional probabilities of W, given (a) Y and Z in their lowest states, and (b) Y and Z in their highest states.

Figure 4. Bayesian network of the example system based on estimated PMFs: (a) normal operation, (b) predictive inference (evidence: X in its lowest state), and (c) diagnostic inference (evidence: W in its highest state).

To evaluate the quality of the inference made using PMFs estimated based on the 100 sample data, the posterior probabilities of X and Y, calculated based on the estimated PMFs, are compared with the posterior probabilities of X and Y calculated based on the actual PMFs. The evidence is that W is in its highest state. This comparison is shown in Figures 5a and 5b, indicating the good accuracy of the inference made based on the ME-estimated PMFs. These figures also show that, given node W in an extreme state that does not appear in the data,

other nodes including the root nodes are updated, the most probable cause(s) of the observed evidence can be identified. In both forward and backward cases, identifying abnormal states (which are often rare states) is of critical importance; that is, states of most interest, whether in risk assessment or fault detection, are often those states that are poorly reflected in the data, because of their small probabilities. Figure 4c shows the updated network, given the evidence that W is in its highest state. 7544

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

in Figures 4b and 4c for both forward and backward inference cases. As shown in Figure 6a, the RKLD values indicate that moving X to its lowest state causes the largest change in node Z and then in U. This prediction makes sense, since X affects U through Z. RKLD values can also be used to identify the most probable cause of an abnormality observed in node W by backward Bayesian inference (Figure 4c). Figure 6b shows that the deviation of the posterior probabilities of Y from its normal operation probabilities is more than that of node X; therefore, the former node is the variable that most likely led to the observed anomaly in node W. The generalization of these analyses to larger and denser systems is straightforward; the same principles for estimating the probabilities apply, regardless of the size of the system or its degree of complexity.

Table 6. Means and Standard Deviations of the Estimated Probabilities of the States of X probability

mean

P(0.5) P(1.5) P(2.5) P(3.5) P(4.5) P(5.5) P(6.5)

0.0104 0.1676 0.7320 0.0460 0.0060 0.0376 0.0004

standard deviation 2.8053 5.7000 6.2000 1.3000 1.9348 2.3000 2.4674

× × × × × × ×

10−7 10−3 10−3 10−3 10−6 10−3 10−8

the Bayesian network trained with the complete probabilities can be used to calculate posterior probabilities of all states of the variables and identify the cause(s) of the evidence. Mainly because of the use of the maximum entropy principle, the information entropy of each posterior probability is higher than that of its actual counterpart. 3.4. Risk Quantification. To quantify the changes in marginal probabilities caused by evidence and identify the node(s) that undergo the most significant change, we use a probability distance measure called Kullback−Liebler divergence (KLD) or information gain.37 This measure reveals information about the relative entropy of two probability distributions defined over the same set of states and, therefore, indicates how different the two marginal probabilities are: N

KLDj =

⎛ Q (Xj = xj , i) ⎞ ⎟⎟ ⎝ P(Xj = xj , i) ⎠

∑ Q (Xj = xj ,i) ln⎜⎜ i=1

4. CONCLUSIONS This paper presented a method of estimating complete multivariate discrete mass probability functions (PMFs) from scarce data, allowing one to estimate the probabilities of states with no observed data (rare states). The method maximizes Shannon’s information entropy subject to constraints imposed by empirical moments of the sample data. The Maximum Likelihood Estimate (MLE) and the Maximum a Posteriori Estimate (MAPE) approaches to optimally select moment function orders were proposed. Advantages of this maximum entropy (ME) method over other probability estimation techniques are as follows. First, since the information content of individual sample points are compacted in the form of moments, larger sample sizes do not affect the speed of convergence. Second, as no limiting assumptions are made on the dependence structure of the domain variables, the method is capable of modeling highly nonlinear relations. Third, the MLE and MAPE criteria to determine the optimal level of the model’s complexity enables one to use the highest possible flexibility in modeling while avoiding overfitting. Because of these features, the method provides reliable probability estimates for states outside the observed region, where most of the risky events are likely to occur. The use of the ME PMF estimation method, together with Bayesian networks, allows the user to calculate the probability of abnormal event propagation, where an abnormal situation in one component of a plant increases the chance of an abnormal

(26)

where P(Xj = xj,i) and Q(Xj = xj,i) are the prior and posterior probabilities of the ith state of node Xj with N states, respectively. When the posterior probability of a state is 0, its corresponding term in the summation of eq 26 is 0, since 0 × log(0) = 0. To measure the most significant deviations in the network, we then use the relative Kullback−Liebler divergence (RKLD) . For a node Xj, the RKLD is defined as RKLDj =

KLDj w ∑ j = 1 KLDj

(27)

where w is the number of the nodes of interest. Figure 6 compares the RKLD values corresponding to the results shown

Figure 5. Diagnostic Bayesian inference based on the actual and estimated PMFs, given W in its highest state. 7545

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

Figure 6. RKLD values for the nodes: (a) predictive inference and (b) diagnostic inference. In Proceedings of the 2005 Winter Simulation Conference, Orlando, FL, 2005; pp 682−691. (7) Scott, D. W. Multivariate Density Estimation: Theory, Practice, and Visualization; John Wiley: New York, 1992. (8) Tsybakov, A. B. Introduction to Nonparametric Estimation; Springer: New York, 2009. (9) Devroye, L.; Gábor, L. Combinatorial Methods in Density Estimation; Springer: New York, 2001. (10) Piotr, J.; Durante, F.; Härdle, W. K.; Rychlik, T. Copula Theory and Its Applications: Proceedings of the Workshop Held in Warsaw 2009; Springer: Heidelberg, 2010. (11) Zellner, A.; Highfield, A. R. Calculation of maximum entropy distribution and approximation of marginal posterior distributions. J. Econometrics 1988, 37, 195−209. (12) Duong, T.; Hazelton, M. L. Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation. J. Multivar. Anal. 2005, 93, 417−433. (13) Embrechts, P.; Lindskog, F.; McNeil, A. Modelling dependence with copulas and applications to risk management. In Handbook of Heavy Tailed Distributions in Finance; Rachev, S., Ed.; Elsevier: New York, 2003; pp 331−385. (14) Mehranbod, N.; Soroush, M.; Panjapornpon, C. A method of sensor fault detection and identification. J. Process Control 2005, 15, 321−339. (15) Villez, K.; Srinivasan, B.; Rengaswamy, R.; Narasimhan, S.; Venkatasubramanian, V. Kalman-based strategies for Fault Detection and Identification (FDI): Extensions and critical evaluation for a buffer tank system. Comput. Chem. Eng. 2011, 35 (5), 806−816. (16) Bin Shams, M. A.; Budman, H. M.; Duever, T. A. Fault detection, identification and diagnosis using CUSUM based PCA. Chem. Eng. Sci. 2011, 66 (20), 4488−4498. (17) Pariyani, A.; Seider, W. D.; Oktem, U. G.; Soroush. M. Improving process safety and product quality using large databases. In Computer-Aided Chemical Engineering; Pierucci, S., Buzzi Ferraris, G., Eds.; Elsevier: Naples, Italy, 2010; Vol. 28, pp 175−180. (18) Castillo, I.; Edgar, T. F.; Dunia, R. Nonlinear model-based fault detection with fuzzy set fault isolation. In IECON 201036th Annual Conference on IEEE Industrial Electronics Society, Phoenix, AZ, 2010; pp 174−179. (19) Ö zyurt, B.; Kandel, A. A hybrid hierarchical neural networkfuzzy expert system approach to chemical process fault diagnosis. Fuzzy Sets Syst. 1996, 83 (1), 11−25. (20) Isermann, R. Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance; Springer: Berlin, 2006. (21) Mohseni Ahooyi, T.; Arbogast, J. E.; Oktem, U.; Seider, W. D.; Soroush, M. Maximum-Likelihood Maximum-Entropy Estimation of

situation in another component. Although Bayesian networks permit decomposing high dimensional multivariate distributions, making the method more tractable, care must be taken when working with overly dense networks and/or having high numbers of states that may significantly increase the number of variables and the resulting computational cost.



AUTHOR INFORMATION

Corresponding Author

*Tel.: (215) 895-1710. Fax: (215) 895-5837. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This material is based upon work supported by the National Science Foundation under Grant Nos. CBET-1066461 and CBET-1066475. T.M. was supported by the National Science Foundation (through Grant No. 1066461). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors would like to thank Amogh V. Prabhu, Darrin Feather, Benjamin J. Jurcik, and Brian M. Besancon for their invaluable comments on this work.



REFERENCES

(1) Harms-Ringdahl, L. Safety Analysis, Principles and Practice in Occupational Safety; CRC Press: New York, 2001. (2) Garvey, P. R. Analytical Methods for Risk Management: A Systems Engineering Perspective; Chapman and Hall/CRC: Boca Raton, FL, 2009. (3) Taleb, N. N. The Black Swan: The Impact of the Highly Improbable; Random House: New York, 2007. (4) Kaynar, B.; Ridder, A. The cross-entropy method with patching for rare-event simulation of large Markov chains. Eur. J. Oper. Res. 2010, 207, 1380−1397. (5) Shahabuddin, P. Rare event simulation of stochastic systems. In Proceedings of the 1995 Winter Simulation Conference. IEEE Press: Washington, DC, 1995; pp 178−185. (6) Cerou, F.; LeGland, F.; Del Moral, P.; Lezaud, P. Limit theorems for the multilevel splitting algorithm in the simulation of rare events. 7546

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547

Industrial & Engineering Chemistry Research

Article

Multivariate Probability Density Functions. AIChE J. 2014, 60 (3), 1013−1026. (22) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, 2003. (23) Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 1984, 27, 379−423. (24) Williams, D. Weighing the Odds: A Course in Probability and Statistics; Cambridge University Press: New York, 2001. (25) Shannon, C. E. Prediction and entropy of printed English. Bell Syst. Tech. J. 1951, 30, 50−64. (26) Gray, R. M. Entropy and Information Theory, 2nd ed.; Springer: New York, 2011. (27) Eliason, S. R. Maximum Likelihood Estimation: Logic and Practice; SAGE Publications, Inc.: Newbury Park, CA, 1993. (28) Ruszczyński, A. Nonlinear Optimization; Princeton University Press: Princeton, NJ, 2006. (29) Subramanian, J.; Simon, R. Overfitting in prediction modelsIs it a problem only in high dimensions? Contemp. Clin. Trials 2013, 36 (2), 636−641. (30) Jefferys, W. H.; Berger, J. O. Ockham’s Razor and Bayesian Statistics. Am. Sci. 1991, 80, 64−72. (31) Briscoe, E.; Feldman, J. Conceptual complexity and the bias/ variance tradeoff. Cognition 2011, 118 (1), 2−16. (32) Ando, T. Bayesian Model Selection and Statistical Modeling; Chapman and Hall/CRC: Boca Raton, FL, 2010. (33) Ferentinos, K. On Tchebycheff’s type inequalities. Trab. ́ Estadistica Invest. Oper. 1982, 33 (1), 125−132. (34) Netica v. 4.16 [software]; Norsys Software Corporation; Vancouver, BC, Canada, 2010. Available at http://www.norsys.com. (35) Quenouille, M. H. Notes on Bias in Estimation. Biometrika 1956, 43 (3−4), 353−360. (36) Lauritzen, L.; Spiegelhalter, J. Local computations with probabilities on graphical structures and their application to expert systems. J. R. Stat. Soc. 1988, 50, 157−224. (37) Kullback, S. Letter to the Editor: The Kullback−Leibler distance. Am. Stat. 1987, 41, 340−341.

7547

dx.doi.org/10.1021/ie404232v | Ind. Eng. Chem. Res. 2014, 53, 7538−7547