Groundwater Residue Sampling Design - American Chemical Society

arbitrary number of objectives for the least cost. Inferential Population .... specified domain sizes. The domain sizes to provide the basis for the d...
0 downloads 0 Views 1MB Size
Chapter 5

Minimum Cost Sample Allocation 1

Robert E. Mason and James Boland 1

2

Research Triangle Institute, Research Triangle Park, NC 27709 U.S. Environmental Protection Agency, 401 M Street SW, Washington, DC 20460

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

2

A procedure for determining the minimum cost allocation of samples subject to multiple variance constraints is described. The procedure is illustrated using information developed for the National Pesticide Survey conducted by the United States Environmental Protection Agency.

Seldom are field studies conducted with but a single objective. More usually, the investigator is faced with the problem of designing a field study to satisfy multiple objectives, often with limited resources available. This paper addresses the problem of allocating field study resources to simultaneously satisfy an arbitrary number of objectives for the least cost. Inferential Population The first step in designing a field study is to develop a fully operational definition of the population (or universe) of inferential interest. Five points are addressed in the population definition. • • • •

the spatial dimension of the population the temporal dimension of the population the units of observation that comprise the population eligibility criteria to differentiate between population units and otherwise similar units (of no interest to the study) • the identification of domains (groups or subpopulations of units) that are of special interest to the investigation

The second step is to identify and define the population parameters that are to form the basis of the design, that is, the characteristics of the population that are the central subject of the investigation. These may be population 0097-6156/91/0465-0091506.00/0 © 1991 American Chemical Society

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

92

GROUNDWATER RESIDUE SAMPLING DESIGN

totals, averages, proportions, regression relations, comparisons, and so on, and are defined as functions of observation or response variable values over the entire population. The final design step is to specify the magnitudes of the variances that are to be associated with the identified parameter estimates. The specifications often take the form of quantities related to the variances rather than the variances themselves, such as relative standard errors, confidence intervals, or the power to be associated with a statistical test.

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

Population Concepts The units comprising the population of inferential interest are denoted by U g where, g = 1, 2 , . . . , N . Note the implications that, • the population, although perhaps very large, is finite, there being Ν units in total, and, • the population units are distinct, such that an individual is recognizable as the g-th unit. Otherwise the units themselves may be anything, for example, rural domestic wells or ground water volumes defined within a three dimensional space. Arbitrary units such as the latter are constructed with the measurement technology in mind. That is, the units are constructed of such a size and shape that they can be accurately characterized by the measurement procedures planned for use. The objective is to construct units such that the measurement variability is small in relation to the variability among the population units. The spatial dimension of the population definition defines the study site, for example, all rural domestic wells in the United States, or the total ground water volume to a specified depth underlying a specified field. Robust statistical inferences are, of course, limited to the selected study site. That is, statistical arguments supporting the validity of the conclusions reached are themselves valid only for the study site population. If the population parameters of interest to the investigation are temporally varying quantities, then the population units, U g , are defined in both time and space. The total data collection period defines the temporal reference for the study, and inferences are restricted to the corresponding time frame. The g-eubscript in this case takes on the values, g = 1, 2, ..., Ν



Ni + 1, N +2, x

N , N + 1, N + 2 , ..., Ν , t

t

t

where the subscripted N-values denote the number of spatial units available for

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

5.

MASON & BOLAND

Minimum Cost Sample Allocation

93

study at different times. The times are denoted by, t = 1, 2, . . . L , and the total population size is defined by,

Ν=

Σ

N . t

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

t=l The time intervals identified by the t-subecript are arbitrary. Like arbitrarily defined spatial units, temporal units are constructed such that measurement errors are kept small in relation to the variability that exists among the temporal units. That is, a temporal unit is of short enough duration that the variability of possible response variable values within a unit is small in relation to the variability that exists from one unit to another. An observation or response variable value associated with the g-th unit in the population is denoted by yg. Note the implication that every unit in the population is observable. The point has some importance in identifying the population parameters to form the basis of the design and in the subsequent data analysis. A univariate population mean provides a familiar example of a population parameter. The quantity, 1

Ay =

N

ft Σ y > g=i

g

defines the mean. The population variance is defined by,

1

Vy = ^ £ [yg - A y ] · = 1

g Two problems can arise. First, some information about the magnitudes of Ay and Vy is needed for design purposes. Sometimes the information is available from previous studies, but more commonly the information is not available, the purpose of the study being to provide it. Second, note that if y-values are not able to be obtained for some values of the g-subscript, then neither the parameter nor its variance is defined. If, for example, yg is the observed concentration of a specified chemical in the g-th unit, then y-values are not observable for as many units as have concentrations below the method detection limit. A convenient way around both problems is to design the study in the context of specifying the probabilities with which specified contamination

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

94

GROUNDWATER RESIDUE SAMPLING DESIGN

frequencies will be detected. The exercise is equivalent to specifying the maximum values of the variances to be associated with sample estimates of specified domain sizes. The domain sizes to provide the basis for the design are determined based either on what is known about the actual state of nature, or on policy and program considerations. The sampling designs for both the EPA's National Pesticide Survey (Mason, R. E. and R. M . Lucas, Research Triangle Institute, report number RTI/7801/04-04F, 1988, unpublished) and Monsanto's agrichemical survey (Graham, J . Α., presented at Groundwater Quality Methodology Workshop, Arlington, Virginia, November 1988) were developed along this line. Specifying the design problem this way has some generality and provides a useful surrogate for other parameters. Certainly parameters describing other domain characteristics are unlikely to be reliably estimated if the domain sizes themselves cannot be. In this context, the observed chemical concentrations place the g-th unit in a specified concentration category or domain. Notationally, the indicator variable, = 1, if the g-th unit belongs to the d-th domain, = 0, otherwise. The indicator variable is observable for every unit in the population, assuming that 'below the detection limit is one of the domains. The parameters of design interest become the relative domain sizes (population proportions) defined by, 9

dg' with the associated population variances,

Sampling Concepts In designing a sample, the investigator assigns (relative) selection frequencies to each of the population unite such that,

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

5.

MASON & BOLAND

95

Minimum Cost Sample Allocation

• linear statistics provide design unbiased estimates of corresponding parameters, and, • the sampling variances of the parameter estimates do not exceed prespecified values. Selection frequencies are denoted by,

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

where, η = the sample size, Sg = the size measure associated with the g-th unit, and, Ν s+ = Σ s . g

g = i

In multi-stage sampling, the g-subscript is replaced by subscripts that identify the sampling units at each stage. The ranges of summation of these subscripts extend over the set of sampling units contained in each of the sampling units selected at the previous stage. That is, selection frequencies are assigned and samples are selected at each stage of sampling independently within the previous stage. If stratification has been imposed on the sampling frame, the ranges of summation extend over the set of sampling units contained in a stratum. That is, the selection frequencies are independently assigned and samples are independently selected within each stratum. The size measure is (ideally) proportional to the value of the response variable associated with the unit, if information for the purpose is available, or can be set equal to one for all values of the relevant subscripts (equal probability sampling). Size measures can also be computed to simultaneously achieve specified sampling frequencies for multiple domains (J), if information for the purpose is available. Similarly, in multi-stage, stratified designs the sample sizes (η-values) are determined for each stage of sampling within each of the design strata. The following section describes a procedure for determining sample sizes to satisfy arbitrary variance constraints for the least cost. The Kuhn-Tucker Conditions The sampling variances can be expressed as a function, Var(n) j, of a vector of sample sizes, n, selected from within each the design strata at each stage of sampling. The variable cost of the field study can be expressed as a function, (

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

96

GROUNDWATER RESIDUE SAMPLING DESIGN

C(n), of the same sample sizes. The sample allocation problem can then be stated in terms of minimizing the cost function, C(n), subject to the inequality variance constraints given by, Var(n) < K j . d

The values K j are chosen by the investigator. The solutions sought, denoted by *a, are the sample sizes that minimize the objective function,

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

Ο(Β,λ) = C(n) + Ç \

d

[ K - Var( ) ] d

B

d

(1)

where is the Lagrange multiplier associated with the variance constraint imposed on the estimated size of the d-th domain. Taking derivatives of the objective function with respect to the vector of sample sizes and equating to zero yields (gradient) equations of the form,

~wr~t

d

·

()

If the variance constraints hold, then at *n there must exist values of the Lagrange multipliers, 'λ^, such that equation 2 evaluated at *n is true and, additionally, Var( *n) < K , d

d

•A > 0 ,

(4)

d



ά

(3)

[Var( - ) - K j = 0 . B

d

(5)

Equations 2 through 5 are the Kuhn-Tucker necessary conditions (see, for example, (2), pages 186 and 192). A general exposition of the application of Kuhn-Tucker theory to the problem of determining the minimum cost allocation of samples subject to multiple variance constraints is presented in (3). For all but the simplest of sampling designs, the allocation solutions are found using iterative numerical procedures. If, in the iterative procedure, the initial values of the Lagrange multipliers, denoted by °λ^, are computed to equal the values that individually satisfy the variance constraints, then a comparison of the initial and final values will identify the relative importance of each constraint in determining the allocation solutions. Superfluous constraints,

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

5.

MASON & BOLAND

Minimum Cost Sample Allocation

97

that is, thoee coincidentally satisfied with the imposition of other constraints, will have final Lagrange multiplier values •A = 0. d

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

The most important constraints will have final values

The final values that most closely approach the initial values identify the variance constraints that are driving the field study costs. A small relaxation in the identified constraints can produce sizeable cost reductions. An Example The rural well component of the EPA's National Pesticide Survey (NPS) provides an example. The NPS design, data collection procedures and pilot implementation is described in Mason R. E., et al., Research Triangle Institute report number RTI/7801/06-02F, 1988, unpublished. A summary of the relevant sampling design information for present purposes is as follows. Sampling Design. The sample was selected in three stages. A sample of counties was selected at the first stage. The county frame was stratified in two dimensions. The first dimension identified counties with quantifiably high, moderate, low and uncommon agricultural use of pesticides based on the use in 1982 of 63 targeted chemicals on 29 targeted crops. The second dimension identified those counties within use strata having the highest, intermediate and lowest potential for ground water contamination based on the distribution of county level DRASTIC scores (Alexander, W. J., et al., Research Triangle Institute unnumbered report, 1985, unpublished) over those counties in the same use stratum. First-stage strata are denoted by the subscript, a = 1, 2,

12 .

Second-stage sampling units were non-overlapping land area segments that, in the aggregate, accounted for the total rural land area in each sample county. The segments were constructed of a size convenient for counting and listing all domestic wells contained in a segment. The second-stage frame was stratified to identify those sub-county areas most vulnerable to ground water contamination and having the highest agricultural crop production. Secondstage strata are denoted by the subscript, b = 1, 2 . Third-stage sampling units were operable domestic wells. The number of

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

98

GROUNDWATER RESIDUE SAMPLING DESIGN

wells in the b-th second-stage stratum and a-th first-stage stratum is denoted by ^ab' °f " * " first-stage stratum by, t n e

N

a+ =

n u m D e r

Σ

N

a

w e

b

8

nt n e

a

t n

.

a D

b=i

The values shown in Table I, the numbers of households with wells, were used as surrogates for the values N + and N ^ . Table II identifies the domains, the domain sizes, and associated precision requirements that form the basis of the design. The precision requirements were stated in terms of the relative standard errors to be associated with sample estimates of the specified domain sizes. The detection probabilities and approximate confidence intervals shown in the table were computed from the standard errors. The first specification in Table II, for example, says that the relative standard error to be associated with a sample estimate of any domain of wells that comprises one percent or more of all wells nationally is not to be greater than 100 percent of the domain size. Equivalently, the survey is required to have at least a 63 percent chance of detecting any domain of wells that comprises one percent or more of the total, or, that the confidence interval about the sample estimate of a domain of this size have the limits indicated in the table. The specifications for the remaining domains have a similar interpretation, except that one percent of the wells in stratum 1, 2, and 3 (domain 2 in Table II), translates into 0.14 percent of wells nationally (and so on for domains 3, 4 and 5). Other interpretations of the precision requirements shown in the table and, indeed, other equivalent specifications can be developed. The essential point of the exercise in developing the table is to provide,

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

a

&

• with pre-specified reliability, • estimates of parameter values that have policy and program importance, • within the resources available for the study.

Variance Model. If P ^ u denotes the relative size of the d-th domain in the b-th second-stage and the a-th first-stage stratum, then the parameters of interest are given by, ρ

12 Na+ * . ^2 Νab» ρ ν* " a ï l ^ b £ l *Z A



d

aSl ^

d

d

a

b

a

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

5.

MASON & BOLAND

Minimum Cost Sample Allocation

99

Table I. Stratum Sizes First Stage Strata Second Stage Strata

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

*

Households With Wells (thousands)

b

Nab

1. High average use, high average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 2. High average use, moderate average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 3. High average use, low average vulnerability 1. Most heavily croppped and vulnerable 25 percent 2. Remaining areas 4. Moderate average use, high average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 5. Moderate average use, moderate average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 6. Moderate average use, low average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 7. Low average use, high average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 8. Low average use, moderate average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 9. Low average use, low average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 10. Uncommon average use, high average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 11. Uncommon average use, moderate average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas 12. Uncommon average use, low average vulnerability 1. Most heavily cropped and vulnerable 25 percent 2. Remaining areas

455 114 341 916 229 687 440 110 330 684 171 513 1,417 354 1,063 671 168 503 1,154 289 866 2,270 568 1,702 1,170 293 878 1,043 261 782 1,894 474 1,421 997

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

249 748

100

GROUNDWATER RESIDUE SAMPLING DESIGN

Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

Domain Description d 1. All wells nationally

Table II. Precision Requirements Item

Value

Relative domain size Relative standard error Detection probability Confidence interval

0.01 1.0 0.63 0.0 - 0.30

Wells in counties with highest average use (a=l, 2, 3)

Relative domain size Relative standard error Detection probability Confidence interval

0.0014 0.85 0.75 0.0 - 0.004

3. Wells in counties with highest average vulnerability (a=l, 4, 7, 10)

Relative domain size Relative standard error Detection probability Confidence interval

0.0025 0.85 0.75 0.0 - 0.007

4. Wells in the cropped and vulnerable parts of counties (b=l)

Relative domain size Relative standard error Detection probability Confidence interal

0.0025 0.525 0.97 0.0 - 0.005

5. Wells in counties with highest average use and vulnerability (*=D

Relative domain size Relative standard error Detection probability Confidence interval

0.0003 1.25 0.47 0.0-0.011

where, N

+ +

- 2 a=l b=l

The sampling variance, Varfnjj, is made up of three components, one for each stage of sampling, divided by the (to be determined) sample sizes selected at each stage. Notationally,

where, n

l a

= the number of sample counties (to be) selected from the a-th first-stage stratum,

In Groundwater Residue Sampling Design; Nash, R., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1991.

5. MASON & BOLAND

Minimum Cost Sample Allocation

101

n,2ab

the number of sample sub-county segments (to be) selected from within the b-th second-stage stratum constructed within each of the sample counties,

η sab

the number of sample wells (to be) selected from within each sub-county segment classified into the b-th second-stage stratum and a-th first-stage stratum.

The variance components themselves are functions of population variances and (intracluster) correlations. The correlations, denoted by R j and R j * arise respectively because of, Downloaded by CORNELL UNIV on August 6, 2012 | http://pubs.acs.org Publication Date: June 20, 1991 | doi: 10.1021/bk-1991-0465.ch005

1 (

a

2