Chapter 4
Geostatistics for Sampling Designs and Analysis Allan Gutjahr
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
New Mexico Institute of Mining and Technology, Socorro, NM 87801
Spatial variability and its affect on groundwater flow and transport is an active research field. The characterization of that spatial (and possible temporal) variability can often be done effectively by using geostatistical techniques. The methods used and the implications for designs and analysis of groundwater transport and pollution problems will be discussed and illustrated. Discussion will include the incorporation of soft-data and their utility.
Classical statistical procedures are mainly concerned with estimating mean values: variation is viewed as a nuisance that needs to be controlled. By way of contrast geostatistics deals with data and problems that include uncontrollable variation that also has structure. The data most often is taken in space (either two or three-dimensions) and is presumed to have some embedded connectedness for continuity. The objectives of any analysis can vary and include explanation of the variability, construction of predictive models, interpolation and extrapolation of values, design of sampling plans and interrelationships between different properties like conductivity and concentration. The geostatistical approach views variation as part of an overall problem that can convey information about the phenomena being studied. Another aspect to note about geostatistics is that often only one realization is available and any inference requires additional assumptions. For example, in making predictions about contamination only data from a single region may be available. In addition, predictions for that region, taking into account observations, are desired. While there may be multiple observations, they are generally not independent and consequently many of the classical statistical procedures are not applicable. The statistical inter-relationship is summarized by a covariance function that is a measure of how observations at different locations are 0097-6156/91A)465-0048$11.75A) © 1991 American Chemical Society
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
4. GUTJAHR
49
Geostatistics for Sampling Designs and Analysis
statistically connected. This covariance behavior needs to be inferred from the data. The result can then be used to make predictions for values at unsampled locations along with measures of uncertainty for those predictions. In this paper the sections start with the fundamental terms and definitions and then move on to some ramifications and current developments. The key geostatistical definitions are summarized in a glossary at the end of the paper. The approaches will be compared to and contrasted with classical statistical ideas and important issues that arise in the application of geostatistics will be illustrated. In addition to the estimation problem concerns about prediction and the associated uncertainties will be addressed. The use of qualitative and "soft" data will be discussed and some simple examples will be used to show the utility of this approach. Several examples and a case study will be presented. The chapter closes with some open questions and a glossary. An elementary and clear introduction for readers with a minimal background in statistics is given elsewhere (1). More advanced texts that require some greater knowledge of probability and stochastic processes are available in brief (2) and expanded (3) versions. For non-linear geostatistics and other extensions, the brief monograph (2) is highly recommended. Geostatistical Concepts Random Fields. The stochastic concepts underlying geostatistics come from the area of random fields, although Journal (4,5) also presents some of the ideas with a minimum of random field theory. Locations in space (1,2,3 or 4 dimensions where time may be included) are designated by χ and a generic variable or function of χ is designated by V(x). Definition: V(x) is a random field if, for any fixed location Xo, V(XQ) is a random variable. V(x) is also called a spatial stochastic process. The term random field is preferred here because it emphasizes the spatial nature of the region over which the process is defined. One point to note about this general definition is that a description of the statistical or probabilistic behavior of V(x) involves not just the univariate distribution P[V(x) * v ], (the probability that V(x) is less than or equal to v ), but also the joint distributions P(V(x ) * v V(x ) * v ,...,V(x ) * v ) for any set of locations x ..x . An important aspect of the model is that V(x ) has some statistical relationship with V(x ). This contrasts, for example, with multiple regression trend surface models where Y(Xj) = m(Xj) + e ; m(x) is some non-random function and the €j's are statistically independent. The probability model postulates a structure where one could, in theory, have different outcomes "paths" or "realizations" at any location x. The operator Ε is used for expectation and E[V(x)], is the expected value of V at location x, is taken over all possible "paths" or "realizations" at x. Namely E[V(x)] is an ensemble average. 0
0
a
v
lf
n
2
2
n
n
x
2
}
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
50
GROUNDWATER RESIDUE SAMPLING DESIGN
Stationarity. Because a complete joint probability distribution is virtually impossible to find, the first two moments are often the main focus: The mean: E[V(x)] = m(x) Thç çovfrriançg: Cov[V(x), V(y)] = E{[V(x)-m(x)][V(y)-m(y)]}. Here χ and y denote two different locations in space. Even these, at this level of generality, may be hard to estimate and invariably further kinds of stationarity decisions or assumptions are made. The stationarity decision refers to the model (4-6) and is not one that can be tested in any statistical manner. Stationarity assumptions relate to inference questions (4,5). In standard statistical studies one has a population picked a-priori about which inferences are made: This population decision is analogous to the decision about stationarity. Stationarity may be reasonable to assume over one region studied (e.g. km ) but not over a larger region where subdivision or stratification may be required. Observations can throw doubt on the prior decision but can't be used in any direct statistical test. In a conventional sampling study the variable of interest could be a variable like telephone usage. An initial study may not involve stratification but further examination of results could lead to a finer breakdown and use of special sampling techniques. Analogously, in geostatistics an initial study may involve a composite sample that after careful scrutiny could be divided into sub-divisions where the stationarity hypothesis is tenable. In another context stationarity can be taken as a working hypothesis. 2
Statistical Homogeneity. A common statistical assumption is that of statistical homogeneity or second-order stationarity. Definition : V(x) is statistically homogeneous if (i) E(V(x)) = m is constant (ii) Cov [V(x),V(y)] = Q,(x-y) only depends on the separation vector x-^. Once again χ and y denote two different locations in space. The function C (x-y) is called the (auto-) covariance function for the field V(x). If QQç-.y) depends only on the distance | |x-y | | and not the direction then V(x) is called statistically isotropic. Figure 1(a) shows some typical forms of theoretical covariance functions and Table I summarizes associated formulas. v
For example, the porosity in a region associated with a given layer of material can be taken as the random field V(x). The covariance function at 0 is the variance of the field. The correlation function p (x-y) = C (xy)/C (Q) is a measure of how well linear prediction can be made about V(y) based on an observation V(x). A correlation function that dies out slowly essentially corresponds to a "smoother" process that has a higher degree of predictability even if the variance is large. v
v
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
v
4. GUTJAHR
51
Geostatistics for Sampling Designs and Analysis
Table I Some Covariance Functions s = χ - £ = separation vector 2
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
Exponential: C(s) = σ exp{-|i|/A} 2
2
2
Bell: C(s) = σ βχρ{-1&| /λ } 2
Triangular: C(s) = σ [1 - [s|/b] C(s) = 0
, |s| s b , |s| >b
2
2
1/2
Exponential: C(s) = σ e x p [ - ( £ Jj /*?) 1 j- ι 2
3
Spherical: C(s) = o [M.5(UI/B) + 0.5(|s|/B) ], |$| s Β C(s) = 0 , ls| > B.
Analogy to Nested Designs. The role of covariance or correlation is central in geostatistics and is one feature that sets it apart from classical statistics. Yet it is well to remember that in classical statistics correlation, too, plays an important role. For example in completely randomized (or nested) experiments (7), a model of the form = μ + A + B.. {
+
e ; ijk
/=l..i, ;=1..J, k=\..K
(1)
is postulated. This model applies, for example, to measurements taken when I fields are selected at random, J plots within each field are selected and then Κ measurements made at random within each plot. Typically μ is assumed to be an unknown constant, the A, Β and e random variables are independent and there is an intra-class correlation between Y^ and Y^,. The nested model has several analogies with the stochastic models. The decision to write the data as in model (1) is similar to the stationarity k
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
52
GROUNDWATER RESIDUE SAMPLING DESIGN
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
decision - it is an underlying decision about the experiment. It can be modified based on the observations but it is not generally a testable question. A n important aim in a nested experimental design is to characterize the variability associated with the various sources (e.g. fields, plots and measurements). In the random field model the estimation of the covariance function also has as its aim the characterization of variability. Characterization of variability. What are some important features that characterize this variability? Two of these have been mentioned before: The variance of the field V(x) and the covariance function. The covariance function, when rescaled by the variance, indicates predictability. The behavior of the covariance function is often summarized by a "scale" indicating a significant correlation distance. There is no rigorous definition of such a scale: here it is simply taken to be that value λ where Ο,(λ)/Ο,(0) = e' , assuming V(x) is statistically isotropic. 1
_1
Definition: The scale, λ, is that value where C (X)/C (0)=e . v
v
Figures l(b,c) illustrate some of these ideas. Figure 1(b) shows several paths from a process with small variance ( σ = var[V(x)] = 1) and a short scale (λ = 1.0). Figure 1(c) shows paths with larger variance ( σ = 2) but a long scale (λ = 10). In Figure 1(c) one can do a fairly good job of predicting V(x + 0.5) from V(x) - essentially the paths are smoother than in Figure 1(b) but there is a larger variability between paths. This again is analogous to the nested case. The variability within shorter path segments is small while the variability between paths is large. In general, the functional form of the covariance function is not as important as the value of the variance and the scale. 2
ν
2
ν
Intrinsic Random Fields and Variograms. The concept of statistical homogeneity or second-order stationarity may be too restrictive and this has led to related stationarity constructs, most notably that of an intrinsic random field of order 0 (8,9). Definition: V(x) is an intrinsic random field of order 0 (IRF-0) if (i) E[V(x)] = m (ii)
2
Y (x-i) = 1/2 E[V(x) - V(y)] only depends on x-y v
(Hi) var [J2 ay(x)] is finite if £ i - 1
a, = 0
i - 1
where var [ ] denotes the variance of the expression within [ ]. γ ($) is called the (semi-) variogram, and 3 = x-y is the vector between the two locations χ and y. The modifier semi- is dropped and y(s) is simply referred to as the variogram. ν
v
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
GUTJAHR
Geostatistics for Sampling Designs and Analysis
1.10
0.90
0.70
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
ο
0.50
0.30
0.10
-0.10 ι ι 0.00
ι ι ι
1.00
2.00
3.00 Separation: s
I ι ι ι ι ι ι ι ι ι I
4.00
5.00
6.00
Figure Ma) Some typical covariance functions. solid line: exp {Is]} long dash: ri-(fsf/2] intermediate dash: exp {-sy2} short dash: exp {-s} 2
3.00
10.00 2
Figure Kb) Paths with σ = 1, λ = 1, Exponential covariance
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
GROUNDWATER RESIDUE SAMPLING DESIGN
3.00 -2
2
Figure 1(c) Paths with σ = 2, λ = 10, Exponential covariance
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
4. GUTJAHR
55
Geostatistics for Sampling Designs and Analysis
A simple example of a random field in 1-dimension that is IRF - 0 is Brownian motion. The discrete version is of the form m
V(m) = £
W
k
k - 1
2
where E[W ]=0, var (W ) = σ Δ and the W 's are independent. The covariance function is cov (V(k) V(l)) = σ Δ min (k,l) and so V(k) is not statistically homogeneous. On the other hand v (j) = o Aj/2 and hence V(k) is IRF = 0. When V(x) is statistically homogeneous then k
k
k
2
2
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
v
Yvfc) = CvGO - C (s).
(2)
v
The variogram is a measure of dissimilarity (2). For the statistically homogeneous case, from equation (2) it is seen that γ ( £ ) is the inverted covariance function. The limiting value of C (Q) is often referred to as the sill. ν
V
Data analysis and Geostatistics Ergodicity. In most geostatistical studies the covariance function or variogram needs to be estimated along with other features of the observed data. This estimation is almost always based on observations from a single realization and invariably an assumption of ergodicity is made. The assumption, loosely stated, says that if data from a single realization is taken over a large enough field then the sample mean and covariance/variogram will be close to the ensemble mean and covariance/variogram. In other words, space averages will converge to ensemble averages. This is a strong assumption and again is not amenable to statistical testing. For the field that has realization shown in Figure 1(c) one would need a sample over a large range to achieve the desired convergence. The ergodicity decision may be de-emphasized if the inference is for spatial averages on specific areas (5). Data Analysis. Assume η observations, V(x )...V(x ) are available and the objective is to estimate quantities like the mean, covariance, and variogram. Prior to undertaking such estimation an exploratory data analysis should be carried out for the field. A good elementary reference for exploratory data analysis is the paper by Tukey (10). Note that the data here is not like an independent random sample. Concepts like the empirical distribution function still can be used but distributional tests (e.g. chi-square or Kolmogorov-Smirnov goodness of fit tests) are not directly applicable. In effect the correlated nature of the data yields less information for certain purposes. x
n
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
56
GROUNDWATER RESIDUE SAMPLING DESIGN
A declustering technique (11) can be used to account for the correlation where the estimator of the distribution function is similar to a probability weighted estimator of the type used in sampling theory. The idea is to overlay a regular grid (say of L squares) where the squares are large enough so all contain at least one point. The distributionfonctionis then estimated by using averages of indicators within the squares, averaged over L. In carrying out an exploratory data analysis, percentiles and quantiles (e.g. the median and deciles) can be especially useful. If there is sufficient data available the regions should be split into several subregions to see how these quantities change, whether transformations (eg. log, square root) are needed and whether other corrections would be useful. Effective values. One concept that carries over simply from time series is that of an effective sample size. Let V ^ ) , i = 1, 2...n be a sample of size η from a statistically homogeneous random field with covariance function C (x-y). Consider estimating the expected value, E[V(x)] by the sample v
η mean:
K= £
V(x)ln.
Definition: Let and
o\ = Cy(0)/n o\ = var(V) = £
£
i» 1 jm 1
The effective sample size. n , is defined as eff
"eff = Φΐ/οΙ)
3
()
The ratio is a reduction factor due to the correlation structure of V(x). If the V(Xj)'s were independent the variance of V would be ο . However, the actual variance in the non-independent case is σ . The effective sample size gives the equivalent number of independent observations yielding the same variance for K. Note that this reduction factor will depend on the data location and the covariance function. In Table II the reduction factor is shown for several designs in two dimensions, where the covariance function used was exp{-|s|/X}. λ
2
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
4. GUTJAHR
Geostatistics for Sampling Designs and Analysis Table II Reduction Factors for Sample Designs
Downloaded by UNIV OF ARIZONA on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch004
Sample Network
10 χ 10, Spacing 10 χ 10, Spacing 10 χ 10, Spacing 5 x 5 , Spacing 5 x 5 , Spacing
= = = = =
0.5 (n=100) 1 (n=100) 2 (n=100) 1 (n=25) 2 (n=25)
Effective
Reduction
Sample Size
Factor o\/a\.
6.87 19.62 54.50 6.38 14.81
0.069 0.196 0.545 0.255 0.592
For example, if 100 samples are taken on a regular grid at spacing of .5 correlation scales, then the effective number of independent samples is only 6.87. Furthermore, for a spacing of 1 correlation length on a 5x5 grid, the effective sample size is 6.38 and for the same spacing on a 10x10 grid the effective sample size is 19.62. The covariance structure consequently implies that, at least for averages, one can have considerably less information than in the independent case. The factor o\/o\ shows the reduction in the sample size needed to get
n . c((
Barnes (12) discusses a related concept and shows that the effective sample size has behavior of the type given in Table II by using a simulation approach. Other authors (13) have also examined estimation of means using geostatistical techniques and found similar results. These considerations would imply that to get a greater effective sample size one would increase the intersampling distances. However we will see below that for accurate variogram estimation the reverse is true: In that case the intersample distances should be small. Covariance and variogram estimation procedure. The data ¥(&) i = l...n will ordinarily not be on equi-spaced or regular grids, but rather scattered across the field. For example, V(xj) may represent concentration of a substance at location x where the x/s are not on a grid. The procedure for variogram/covariance estimation involves grouping the data into classes. Several standard programs (3,14) are available for the variogram calculations and for the non-ergodic covariance (1,15). The procedure described below allows for non-isotropic covariances: That is the covariance may depend on both distance and angle between points. In addition, it is restricted to a field in two dimensions. [y
In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.
57
58
GROUNDWATER RESIDUE SAMPLING DESIGN
The input to the estimation procedure involves a lag-increment parameter δ, a set of directions i = 1...I and an angular tolerance parameter a, Step 1 Form the set of data pairs A(rô,ei) = the set of Xj and x,, with distance between (r-l)ô and rô, and angle between^ and in the interval (β - α, β + a). ϊ
ϊ
= {(xj^ ):(r-l)ô