Signal Processing Techniques for Estimating Parameters of Certain Biological Models Rex A. Naden and J. V. Leeds Departments of Electrical Engineering and Environmental Science and Engineering, Rice University, Houston, Tex. 77001
An often encountered biological system is one whose input is a continuous function of time and whose output is a discrete stochastic process. The output is processed by a recording system which reports, with time period T, the number of events occurring in the prior time interval. Any model of the biological system should contain a model of the data processing section as well. Methods of reconstruction of the discrete data generated by such a system are discussed. These techniques are valuable in that more accurate modeling can often be accomplished by fitting to a continuous function rather than to discrete data points. The above stochastic system is often modeled by a deterministic system; the accuracy of this approximation is discussed quantitatively, in addition to predicting reconstruction error estimates. Combination of the reconstruction error estimates and the stochastic error estimates produces a relation between the maximum total relative error and the sampling period T. These results can be used to estimate total accuracy of the model if the sampling period is fixed, or to plan the best sampling scheme if the sampling time can be externally controlled.
M
any biological systems exist in which the input (or excitation) to the system is a continuous function of time c(t) that can be measured, and the resulting output of the system is a discrete stochastic process. In many recording systems, the output of the biological system is processed to give the number of events occurring in a fixed time interval. Often, the processing and recording of this output are predetermined and cannot be altered by the investigator ; he has access to tabulated data only. The recording system is often a counter which generates reports X o , . , , , X, at times t o , .. . , tn with regular spacing T; a typical report Xk represents the number of events occurring in the immediately prior time period (tk+tk). In certain circumstances the following deterministic model adequately approximates the biological system with a continuous input and a discrete stochastic output. This model generates an analog output p(z) in response to the stimulus c(t). The function p ( t ) is processed by a running-time averager with parameter T to simulate the counting process of the actual system. The output of the running averager, P(t), is sampled at the times to,. . , , t, to produce the numbers Ij(tk) = Ykfor k = 0,. . ., n. These samples are analogous to the physical data Xn.The model data, Y k , can then be compared with the real data, X k , to determine the adequacy of the deterministic model. In order to produce a more accurate model, P(t) should be compared with an analogous signal also time averaged with parameter T-Le., ?(t) from the actual system, for all t rather than just at the sampling instants. Unfortunately, this is impossible to do directly in a system such as the one under study. In addition, even if ?(t) were available, ?(t) is a stochastic signal and therefore unsuitable for direct comparison with p(t). However, a deterministic approximation to ?(t)522 Environmental Science & Technology
namely ?,(t)-can be generated by known signal reconstruction techniques operating on the data set { Xk]and this approximation P,(t) compared with the deterministic signal Ij(t). It is also possible to posit a deterministic model that does not contain a running averager. Let the output of such a model be q(t) for an input c(t). This model can be fitted to the data { X L ] at the sample times or at all times by comparing g(t) with the deterministic approximation ?,(t). However, it can be seen that the model and its parameters depend implicitly on the sampling time T. Hence if T changes, the model changes. A significant part, and a known part, of the system is omitted when the running averager is eliminated. Although a fit to the data can be obtained without the running averager, the model which includes the running averager is a more accurate representation of the actual process. The above discussion suggests the following questions which will be answered in this paper: What is the effect of the reporting system in determining the model? What reconstruction techniques are useful for finding ?,(t) from the data set { Xk)? What are the reconstruction error bounds for these methods? When is a deterministic model good enough, in some sense, to describe the original stochastic system ? Practical application of the theory will be demonstrated as the theory is presented. The system chosen as an example is the well-known air pollution disaster of London in December, 1952 (Goldsmith, 1968) and the London hospital reporting system. The model studied was given in the pioneering work of Friedlander and Ravimohan (1968). Current State of the Air Pollution and Health Enigma
Most modern cause-and-effect studies of air pollution (Martin and Bradley, 1960; Hechter and Goldsmith, 1961 ; Martin, 1964; Sterling, et al., 1966; Sterling, et al., 1967; McCarroll, 1967; Burrows, et al., 1968; Verma, et al., 1969) use statistical tools such as correlation and regression analysis since the presence or absence of the “signal” (morbidity or mortality) is severely affected by the “noise” (other environmental factors). The philosophy of several studies has lately come under criticism (Cassell, et al., 1968; Weaver, 1969), because the “noise” level is so high compared to the “signal” level. An excellent review of the philosophy of applying statistical methods to these problems has been given (Ipsen, et al., 1968). These models do not in general consider the dynamics of the pollution episode, but rather some measure of the pollution episode such as peak value, integrated dose, etc. As sophisticated lung deposition theory becomes available, accompanied by advances in environmental and health data aquisition and handling, combined stochastic-deterministic dynamic models for reliable health prediction will become available. These models will utilize not one, but several environmental signals as well as many health indicators. Construction of these models will rely to a great extent on modern signal processing theory. Even then, a deterministic approximation to the stochastic portion of the model such as in the
L h 1
q lI
?EST COMPARISON ?UT NOT POSSIBLE
EXCITATION
RECONSTRUCTION
I
Figure 1. System abstraction
NODEL: ALGEBRAIC AND D I F F E R E N T I A L EQ’NS
t I I
t I
I
II
J
I
PARAME 1 E R
I D E N T I F I C A T I O N PATHS
Friedlander-Ravimohan model will be valuable and reconstruction of signals will be necessary. The model used below is for illustration only and is not intended to be the ultimate model.
1 I 1 1 1 1 1 1 1111 I, 1 1 1 1 ,
1 1 1 1, I 111,lIII 1 1 , 1 1 1 ,
IO
5
I5
20
25
t,DAYS
15
20
25
1-
r t )-TOTAL NUMBER
Sj*steinAbstractiori and Anaij)sis
Figure 1 presents an abstraction of the system described in the introduction. The unending sequence of events that forms the actual system output can be characterized mathematically by dt
=
5
8(t - t,)
1
m = - m
5 IO b ! X \ - S A M P L E S O F Ti’!
where the delta function s(t) is defined by the integral relation f(t) =
J-m
J (0)6(0 - t)d0 for a n arbitrary function J(0). In-
tegration of the impulses [dr(t)]/dtleads t o r(t). The resulting integral is delayed by a time T , then subtracted from the current value of the integral and finally this result is divided by T . This complete mathematical operation is represented by the term “running averager.” The signal ?(t) is then simply the sum of the output events occurring in the time interval [ t - T J ] divided by T . Action of the sampler results in output of the data set { X , ) . See Figure 2 for a graphical presentation of typical operations. The analogous model utilizes a set of algebraic and differential equations to predict y(r) which is interpreted as the instantaneous rate of occurrence of the events of the original system. The running averager performs the same function in the model as in the real system. The sampler with parameter T produces the model data { Y , ) . The running averager smooths and delays the wave shape of p(t). In a similar sense, the recording system affects data of the physical system. As will be seen later, when the recording system-Le., the running averager-is incorporated in the model, the parameters defining the model may differ significantly. The function ?(r) is a random variable which is a representation of another function of time h(r). The function X(t) is a parameter of the stochastic process. In a sense, ?(t) is an estimate of X(t), or conversely X(t) is a “smoothed” version of F(t) (Figure 3). The reporting system only provides samples {A’,)of ?(t). The function X ( r ) must be estimated from these samples. To determine the errors in reconstructing the function ?(t) from the samples X,}, worst-case derivative bounds are needed on ?(t). The function ?(t) is stepwise continuous and thus contains infinite derivatives. Yet, what is desired is a deterministic approximation to a stochastic process. Hence, ?(t)is assumed to be smooth enough that derivatives of
h
T
4
Figure 2. Demonstration of the operation of a typical data reporting system
I 4
4 h. hit’ \
I
‘6
1
‘k
‘k+l
I
oms
Figure 3. Arbitrary section of time data taken from a system such as that in Figure 1 p ( r ) exist and furthermore can be estimated from the model output, p(t). To obtain worst-case reconstruction error bounds, a worst-case excitation signal c,,(t) must be used. The function c,,(t) is the signal that, in the opinion of the modeler, represents the input causing the highest value of the derivative that would be encountered in the function P(t). The resulting p(t) is called puc(t).The function j U c ( tis ) then analyzed to obtain the amount of error as a function of time. The plan of attack to determine the algebraic and differential equations of the model begins with simply assuming a set of such equations with the known input c ( t ) . The values of the unknown parameters Kl, , K, in these equations are optimized by any of several available identification techniques, Volume 5 , Number 6, June 1971 523
717
144
-
pit)
Figure 4. Left: Approximate concentration of SO2 (London, 1952). Right: London death reports (ages 65 to 74)
such as the straightforward gradient method (Fleischer, 1966; Wilde and Beightler, 1967), so that a least-squares pern
formance index
Figure 5. Friedlander-Ravimohan model A : deposition phase; B, clearance phase. Unknown parameters: KI, Kz, K3
- -
Po
( X L - Yk)2is minimized. This completes
k=O
the first stage of the modeling. But now knowledge about the form of the equations and the input is used t o reconstruct the sampled signal, This forms the second stage in the modeling process. The reconstruction method with the least error is chosen and p,(t), the approximation to f ( t ) , is generated from the samples { Xk}. Model optimization is again conducted, this time using an integral performance index such as J-mm
K,CL(t)
'rai
tj-I
Mi) - ?4t)1%
t ,6AYS
tj
Figure 6. Zero- and first-order hold data reconstruction
thus considering a particular measure of the effect of error n
over the entire time of interest. [The product
T.
(Xk- Y J 2is an approximation to the integral error
:s
( p ( t ) - 7,&) 1 Wt. T o decide between the error measures, one must determine whether the integral is adequately approximated by the sum. If it is, minimization of either is equivalent. If not, then minimization of the discrete index will simultaneously (a) minimize the error between the two functions, and (6) minimize the error caused by approximating the integral by a sum. Thus, the coefficients obtained by the discrete method will be influenced incorrectly by (b).] The limits of (- m , m ) are used to consider all time. The functions p ( t ) and ?.(t) are assumed zero outside the time interval of interest. The model arising from these steps is the best model possible with the known data { X,)using the described performance indexes. If the modeler is not satisfied with the value of the performance index, then during either stage the equations themselves may be altered and new coefficients found. To illustrate the application of the above techniques data (Figure 4) from the London fog episode of 1952 (Scott, 1953) and Friedlander's model of that episode are used. It is assumed the c(t) observed is representative of the entire city of London. Friedlander's model predicts departures from the mean average instantaneous death rate given the pollution concentration as a function of time. The model equations are given in Figure 5 (Tis seven days). Iteration on the unknown parameters with the performance 5
index
(X,- Y k ) 2produced the parameter values Kl = 5.0,
values Kl = 5.0, Kz = 0.1307, and K I = 0.0565. Parameter Kz is the rate constant for clearance of the pollution from the lungs. Hence, t.he difference of approximately 2 7 z in the parameter K2 may be significant. Reconstruction Methods and Error Estiinntion
Since the sampled data mentioned in this paper are considered exact, it is reasonable to require that the reconstructed signal agree with the data at the sampling times. For notational ease let the range of approximation be A = { t : to 5 t 5 tnii.e., a region inclusive of the sample times, The n 1 samples beginning with X o at to are uniformly spaced and sequentially numbered. The time dependent errors e i ( t ) are defined by the equation ei(t) = p u c ( t ) - fat(r), where rat is the reconstruction corresponding to technique /. Techniques considered are zero-order hold, first-order hold, sampling function interpolation (ideal low-pass filtering of the data), and cubic spline interpolation. Zero- and First-Order Hold Approximation. These wellknown stepwise polynomial interpolating approximants are shown in Figure 6 for a typical set of samples. Errors for these approximations are available (Hildebrand, 1956). These errors for any teA and sampling interval T are bounded by
+
teA 1
and el(t) 5
T2 'd2p,,' - max -8 1 dt2
A=O
K2 = 0.1570, K 3 = 0.0614 (Figure 5). Friedlander's model, which did not consider the recording system, produced the constants Kl = 5.0, K2 = 0.1800, and KB = 0.0547. Further refining the model by using the integral perfor-
mance index
Lmrn
[p(t) - r,,(t)12dt, where ?.,(t) is the sampling
function reconstruction as discussed below, resulted in the 524 Environmental Science 81Technology
Practical estimation of derivative bounds required by Equations 1 and 2 can be accomplished by (a) evaluation of differencing formulas, and (6) Fourier (frequency domain) analysis. Differencing formulas are applied to the set { pwc(tk)] which yields these estimates for any t* belonging to an interval : -i.e., t* e(
(3)
Substitution of the Fourier analysis estimates into Equations l and 2 yields, respectively,
I eo(t)l 5 129.5T = 910 deaths/week
and d2P,,
dt2;
+ Pue(tn-1)
- 21Sdtk)
Puc(lk+l)
(t*) = ~
_
_
_
2T 2
and (4)
Fourier analysis (Papoulis, 1962) of &(t) requires the reasonable assumption that the function is square integrable. Then its Fourier transform P,,(w) exists and is defined by the following, where j =
dy:
Pdw)
=
s-mc5
P d r ) exp (-iwt)dt
(5)
Equation 5 and its inverse can be represented by the notation Fvc(r) tf P,,(w). The inverse relation of Equation 5 implies, after differentiation,
Using Equation 6, bounds on.
1 el(t)I
5
106.9 8 T 2 = 657 deaths/week
ld2p,,~ max dt2 = 25.1 deaths/week3 teA
~
~’$~57.9 deathsjweek2 and =
‘dZPwdt)’ 1 7 ~ = 7.14 deaths/week3 (9)
7;
(14)
The application of Fourier analysis to pWc(t)results in IP,,c(w)I. A plot of lP,,(w)l as well as other quantities of interest is shown in Figure 7. The complex Fourier transform Puc(w)was obtained by use of a discrete fast Fourier transform routine (IBM, 1968); the computer time required on a Burroughs 5500 was 14.2 sec for a 1024-point transform pair. Actual derivative bounds for the signal Fwc(t)obtained when solving the model for P,,(t) are
d&, d2Puc and can be obtained dt dt
The signal &(r) is obtained from the model as described above using the worst-case signal c,,(t). The worst-case signal c,,(t) is obtained simply by using Friedlander’s approximation to c ( t ) as a worst-case input. This is justified because Friedlander’s approximation in the case of the London data is actually a first-order hold approximation to the measured pollution pulse (British Ministry of Health, Committee of Departmental Offices, 1954). This type of reconstruction produces a wave form with sharper corners than could be expected in the actual SO?concentration as a function of time [and thus higher derivative estimates for PtiC(t)]. Application of the difyerencing formulae in Equations 3 and 4 to the London data given in Figure 4 generates the following estimates : max teA
(13)
~
Thus the differencing estimates are low, while Fourier analysis methods yield high estimates of the derivatives. This can be better explained by considering the curve &(t) in Figure 8. Data sampled every seven days from this curve will not reveal the relatively high curvatures apparent near t = 5 and t = 12, resulting in low derivatives as estimated by the divided difference formulas. The Fourier methods give high estimates because of the error introduced in forming the inequality (Equation 7) from the equality (Equation 6). where the elW‘is replaced by unity. Comparisons of Equations 10-12 indicate the large error possible when using zero- or first-order hold reconstruction and a large sampling time (seven days). The effect of changing the sampling interval will be considered further below. Figure 8 compares these reconstructions from the data { X L ) with &(f). The final worst-case errors which include model and reconstruction errors are 1 e o /ma,. =: 397.3 deaths per week and I ell m a y = 113.7 deaths per week, which are relatively large values. Bandlimited Signals and Sampling Function Reconstruction. Reconstruction of an arbitrary real bandlimited signal f ( t ) containing no frequencies larger than 10, from its samples can be accomplished without error (Shannon, 1949) using the expansion
I
Using Equation 9 in Equations 1 and 2, respectively, gives
1 eo(t)l T- . In this case,
“aliasing” or “folding” error occurs in
the reconstruction. This error will be low, however, if the = 106.9 deaths/week (12)
,z~
fraction of energy in Pwc(w)above Q, 4
a -
T
is not large com-
pared t o the total energy-Le., Volume 5, Number 6, June 1971 525
A
lQa 1 1
Pwc(w)2dw
Q(W 4 Lg
-
1 pZl.,(~)l2do
= I
(17)
3.0
2c-
,;w
, 04
02
.C8
06
12
.IO
2r
14
Actual determination of a bound on aliasing error is relatively easy if the Fourier transform is available (Papoulis, 1966). Suppose that ( r / T ) = 0, < U p ; then a n estimate of pwe(t)is given by
B
$/ t
and the error
2.5
If PwC(w) exists, then R,,(w) and Edw) exist, where 1.0 0.5
.06
.04
.02
08
.I2
.IO
and
.I4
R,,(w)
- P d w ) = E4w) tf e&)
(21)
Figure 9 indicates a typical set of the spectra under consideration. It can be shown that
C 4
where P Rs
D
An upper bound on aliasing error can be written, since for any PEA, using Equations 22 and 23, gives
I 01
02
03 04 05
f,
I
0 6 07
Figure 7. A: Fourier transform of p,, (I). B: square of transform. C: plot of fractional energy contained in p,, (1). D: plot of J’. IP,, ( w ) l ; used in calculating first derivative of pue (I). E: plot of fz.IP,, (a)];used in calculating second derivative of p,,(t)
DATA
POINT
Xk
Actual calculation of B is complicated by the fact that E,(@) is created not cnly by contributions from the “tails” of immediately adjoining images of p , , ( ~ ) along the w-axis, but by a n infinite number of such additions because of frequency domain convolution. However, a saving feature is that PX,.,(w) is a square integrable function; thus the fraction of its energy beyond a given frequency must decrease as that frequency increases. This can be seen by examining a plot of Q(0,) given in Figure 7C. (The function Q is defined in Equation 17.) The fact that the value of Q(w) asymptotically approaches unity allows computation of B since contributions due t o frequency components above certain values can be ignored in practice. The phase relationship between p U c ( t ) and the sampling grid also affects the error bound B. Thus, location of the time origin in taking the Fourier transform P,,(w) influences the value of B to a certain extent. An exact bound o n this phase error is currently being investigated ; a n approximate bound on this error based on empirical observations will suffice for the purposes of this paper. I n the London fog example, B = 2a (119), and thus
I e21 5 119 deaths/week for T = 7
(25)
200
100 0
2
4
6
8
IO
I2
14
16
I8
20
22
24
26
28
30
Figure 8. Zero- and first-order reconstruction 526 Environmental Science & Technology
Figure 10 compares the curves of p,,(t) and ra,(t) for the example problem. The actual worst-case error was 1 e 2 / max = 122 deaths per week. The discrepancy between the stated error “bound” and the actual error committed is due in part
to the phase effect mentioned above and in part to the inability of the simple model being used to accurately predict the death rates encountered. Cubic Spline Function Interpolation. An often overlooked method of direct interpolation is that of cubic splines (Ahlberg et al., 1967), piecewise cubic interpolating functions that
agree in their first and second derivatives at each sampled data point. Spline functions avoid the marked undulatory behavior that commonly arises from fitting a single polynomial to a number of observations. In fact, cubic splines satisfy a minimum curvature property-Le., they minimize
1%'
dt among all functions g(t) which dt2 interpolate given data points and have continuous second derivatives. Under the assumption that r,,(t) is a cubic spline function interpolating the data points { Xh],the following conditions must hold at the sample times ( t k }:
the functional
A
dsk --_ -_ dt dt
and d2si, - - d2sk+l __ dt2 d2t
where s&) [ = ru3(t)in the interval (tk--l,tk)]is defined by
t
Sk(t) =
Mi-1 (tk 6T ~
-
t)3
Mk + -(t 6T
- tk_1)3
+
E2
(28)
Figure 9. Determination of the transform of the folding error
Figure 10. Sampling function reconstruction
A linear system of equations for finding the ( M k } with a tridiagonal coefficient matrix is found from the condition in Equation 28 at the n data points. Inversion of the tridiagonal matrix is easily accomplished in an iterative manner (Wilkinson, 1965). Computation of error bounds for the spline approximation is extremely difficult in practice, although the theory exists (Birkhoff and de Boor, 1964; Secrest, 1965). Empirical error computation is practical by digital computer. Figure 11 compares pue(t)with the spline approximation; actual worst-case error is 102 deaths per week. A summary of error bounds and actual errors encountered is presented in Table I, where the numbers presented resulted Table I. Summary of Errors for T el,
Theoretical error bounds
Zeroorder hold
Firstorder hold
Fourier analysis Differencing Actual errors
910 405 397
657 43.7 113
200 100
eo,
=
Seven Days ez,
e3,
119
. , .
Sampling Cubic function spline interinterpolation polation ...
...
122
102
Figure 11. Spline function reconstruction
ONLYENVELOPES OF THE DISCRETE P R O B A B I L I T Y ARE S H O W N
T = 7
X = I
Figure 12. Poisson probabilities for three values of XT. Note horizontal scale change 0.05
0 004
0
Volume 5, Number 6 , June 1971 527
from first fitting the model to the weekly reported data, then generating pwc(t)by inputing c,,(t) to this model. From Figure 8, it is clear that the function jjUc(t)does not agree with the actual data at every sampling point. Thus, actual errors reflect not only the reconstruction error, but also the error inherent in the model chosen, which in this case is only a firstorder differential equation.
0.70 0.75
I!
i
Approximation of Stochastic Models by Deterministic Models
-I
As indicated in Figure 1, a model is to be constructed such that the deterministic signal jj(t) “approximates” ?(t)as closely as possible. Since ?(t) is not available, the modeler must be satisfied with fitting p ( t ) to some Y a l ( r ) , a deterministic reconstruction of ?(t). However, ?(t) is a stochastic signal. Evaluation of the approximation to r(t) is the subject of the next section while this section furnishes the background theory. A typical set of wave forms for the biological and reporting system is shown in Figure 2. The function r(t) is the cumulative total of output events; the impulses drjdt represent the individual events, which occur a t random points in time. In addition P(t) = [r(t) - r(t - T)]/T. Obviously the “smoothness” of ?(t) depends on the average number of events per period T. A quantitative measure of this uncertainty or smoothness is possible by characterizing r(t) as a nonstationary Poisson process (Blanc-LaPierre and Fortet, 1964; Papoulis, 1965) with mean X(t):
i
Prob ( h events occur in period T ]
=
1
0
1
2
3
4
5 ‘ 6 T, D A Y S
7
8
9
1
0
Figure 13. Fractional error vs. T
--v) I
c
I
I
Y W
I
I
I
I
I
I ,I
I
160
-
140
a
120
Probability densities for three vsllues of X(t)T are shown in Figure 12. Assuming for the moment that X is constant, the error e ( T ) can be calculated by
{
[r(t
-
+ T ) - r(t)] - XT
-
AT
where e(T) is the maximum fractional deviation from the mean with a given probability u*. The value of e(T) is calculated by incrementing l in the summation XT+l
u = .if
= XT
-2
exp(-XT)---
(XT)’f M!
until u 5 u*. At this point, let 1 = I * ; then e(T) = (I*/XT). Two typical uncertainty “bands” with width corresponding to two values of e ( T ) are superimposed on ?(t) in Figure 2. As the probability becomes greater that the signal will fall between the two limits, the width of the band increases. However, the percentage variance decreases as the mean increases. For example, taking u* = 0.9 gives e = 0.50 for AT = 12, and e = 0.093 for XT = 313. When X = X(t) the same argument applies, with the bands equidistant from the mean. A worst-case estimate of e(T) for a nonstationary process is obtained by taking the lowest expected value of X, designated Amin. Uncertainty Principle as Applied to Stochastic Process. The actual goal in the modeling process is to fit p(t) to X(t). However, Figure 3 shows that a typical data point X k does not, in general, coincide with X(t). In attempting to reconstruct X ( t ) from the { X k ) noisy data must be used. How noisy these data are depends on the product X(t)T. It would be inconsistent to use a highly accurate reconstruction technique in a case 528 Environmental Science & Technology
0
1
2
3
4
5
6
7
8
9
1
0
T , DAYS
Figure 14. Error bounds for four reconstruction methods as a function of the sampling interval
IO
I
I
I
I
I
I
I
I
I
9
-
8
-
7
m
>
6
a 0
5
k
4
FUNCTION RECON-
-
3
2
0
5
IO
15
20
e ( T ) ANDTe,JT)
25
,
30 35 40 DEATHS
45
50
Figure 15. Comparison of reconstruction and statistical errors for various values of Xmin .P* = 0.95
8
.7 6
5
v,
> a
4
n -
3
F
2 I
0 % ERROR
Figure 16. Sampling error given as a percentage of the expected deaths in the given time period
7 6 5 (0
2
0
4
E T*, the data are statistically more dependable but the reconstruction will become more inaccurate. Since the total error is bounded by the sum e(T) ebl(T), I = 0,1,2,3 [where ebz(T) are the various reconstruction error bounds], sampling policies other than choosing T such that e ( T ) = eal(T)may result in a lower total error. This will be discussed in the next section. Sampling policies for systems with other values of Amin can be planned using Figure 15, which plots e ( T ) vs. T for a fixed p* = 0.95 and various Amin values. (Note that the errors in this graph are given in number of people, rather than in percentage as in Figure 13.) Also included for comparison is the error curve for sampling reconstruction. The Amin values used correspond to the death rate averages for various age groups (Scott, 1953) before the London disaster. This shows dramatically how the choice of the sampling period is affected by the value of A m i n .
c b , (2)
is used as the
The sampling error bound curve in Figure 15, given in number of deaths, was converted to a percentage curve by dividing the ordinate by X,i,T and the result plotted as Figure 16. Addition of abscissae from Figure 13 (u* = 0.95) and Figure 16 for given T values allows a plot (the “unfiltered” curve of Figure 17) to be made of the sampling time as a function of error. The total error is expressed as a fraction of the number of deaths expected in the given sampling time. As can be seen in Figure 17, there are in general two solutions of sampling time for a fixed error. This clearly expressed the uncertainty principle. The curve marked “worst-case” in Figure 17 represents a worst-case result in terms of reconstruction error because the worst-case input c,,(t) was chosen. I n an attempt to evaluate the severity of this choice o n the result, a “better” case was posited. The input pulse c(t) shown in Figure 4 (left) was passed through a n ideal low-pass filter with cut-off frequency of 0.133 cycle per day and then applied on the model (Figure 18). The time responses of the model to this input, pbe(t)and jjbc(t), are shown in Figures 19 and 20, respectively. The ideal lowVolume 5, Number 6, June 1971 529
pass filter causes the signals C,,(t) and pb&) to have oscillations about zero. Both effects cannot exist in the real world. The low-pass filter was chosen for simplicity in numerical computations. The “better” case error curve of Figure 17 resulted from analysis of these signals. Thus, even under these “better” conditions, the minimum error exceeded 20 which occurs for a three day sampling period (approximately). From these investigations, one concludes that the model results have at least a 20 error and with a seven day sampling period-Le., weekly-the error could be as high as 60 Total error curves for three values of A,!, and with the worst-case input pulse are shown in Figure 21. The minimum error for A m i n = 313, the death rate of the 75-year-old and above age group, was about 2 3 z at T = 2.5 days. The 45 to 54-year age group average death rate was A m i n = 75; minimum error possible with this model was about 4 2 x at T = 3.5 days. The error for seven day sampling was a minimum of 6 0 z . Hence, a significant error reduction results when the “best” sampling period is chosen.
1
W Y
Y
z
z.
I Figure 20. Running average of Pbc
(r)
7 6
5 v,
Conclusions and Projections A stochastic system has been characterized in which intermediate signals are not available. Only sampled data at the output can be used in the investigation. Often this system is modeled with a deterministic model because cause-and-effect relationship is clearer with this simplification. The actual data collection device should be included in the model. Otherwise, the model is incorrect and the parameters of the model will change when the parameters in the reporting system change. A demonstration of the magnitude of this effect on the model parameters shows its importance. In an example of a model of the London fog disaster and its concomitant death rate increase, the inclusion of the data collection device changed the lung clearance constant by about 27 %. A method for obtaining more accurate parameters of the model still using only the discrete data points is presented by using various reconstruction techniques. Error estimates for these methods should always be obtained because a reconstruction is never better than its associated error. The most promising reconstruction method should be chosen, resulting in a model fit by use of a uniform or integral norm rather than a least-squares fit on a few data points. The question of uncertainty in the data was considered by characterizing the stochastic process as Poisson with variable mean. Diminishing accuracy is to be expected if sampling period is decreased below certain levels but the reconstruction accuracy will increase as the sampling period is decreased. Reconstruction error curves are combined with Poisson probability curves t o obtain a plot of the sampling period as a function of the total maximum percentage error. As expected, in general two different values of the sampling period resulted in the same error because of the uncertainty principle. These curves allow the modeler to determine the best sampling period for a given system and to influence the data collection system if indeed the sampling period can be controlled externally. If it cannot, as is often the case, a quantitative assessment of the relative worth of the model can be made. A fruitful application of this theory would be the analysis of a modern pollution episode for which daily data are available (Glasser, et al., 1967; McCarroll, 1967). In this case the stochastic error for daily data may well be greater than the reconstruction error; it could then be determined if threeday samples would be better for model fitting purposes. Future stochastic models of biological phenomena such as health effects of air pollution will be very sophisticated. 530 Environmental Science & Technology
2
4
i
3 -
2
20 25
0
30
35
40 45 50 55 60 65 % TOTAL E R R O R
70
75
80
Figure 21. Sampling time T as a function of total error for various X values. The unfiltered input pulse is used for the sampling error bounds
Quantitative theory for attacking the model formulation and designing the data recording system must be available. I n fact, the question of whether or not a deterministic model is sufficient is of great importance. Hopefully, this work has set the stage for the development of this theory by analyzing a simple system in detail. Acknowledgment
This work was supported in part by a grant from the Environmental Control Administration, CPEHS, U S . Public Health Service 8 T01-~c-00025-03,and by the Standard Oil of Indiana Foundation. Nomenclature A
cdt) ebl(T) I
ei(t) I
the smallest time interval containing all the sampling times = the stimulus t o the biological system, a function of time = worst-case excitation signal-Le., signal that makes reconstruction errors high = error bounds corresponding to the various reconstruction techniques; functions of the sampling interval = final worst-case errors observed; include both model and reconstruction errors = errors corresponding to the various reconstruction techniques, functions of time = stochastic uncertainty, a function of the sampling interval T = Fourier transform of e2(f) for measurement in the example system =
=
=
1,2,3,4
1,2,3,4
= square root of minus one = parameters used in mathematical
model = coefficients necessary for determining
an interpolating cubic spline function = the output of the mathematical model using a filtered input pulse = pbc(r)after running averaging = the analog (as opposed t o “digital”) output of the mathematical model = p ( t ) after running averaging with time period T = the set of values obtained by sampling pYic(t)at the data times to,. . . , tn = the p ( t ) function after running averaging c,,(t) as model input = Fourier transform of p&) = output of a deterministic model that does not contain a running averager = the fractional amount of energy in the signal pW&) found at radial frequencies below O,, the sampling frequency = approximation to $t) in general and corresponding to reconstruction technique 1 = signal from actual system analogous to the model’s p(t) = ra3(t)in the interval (h, fk) = independent varible discrete points in time when data reports are received = probability of a given derivation from the mean in a Poisson process = data reports = model “data” reports I :
U*
GREEK LETTERS
A
Xmin
= general sampling interval, used in
discussion of sampling error evaluation = mean and variance of a nonstationary Poisson process = minimum value of X expected for a nonstationary Poisson process = band limit of the worst-case signal PWdt)
= band limit of the arbitrary signal f ( t )
= TIT
Literature Cited Ahlberg, J. H., Nilson, E. N., Walsh, J. L., “The Theory of Splines and Their Applications,” Chapter 2, Academic Press, New York, N.Y., 1967. Birkhoff, G., de Boor, C., J . Math. Mech. 13, 827-35 (1964). Blanc-Lapierre, A,, Fortet, R., “Theory of Random Functions,’’ Vol. 1, Gordon and Breach, New York, N.Y., 1964, pp 141-53. British Ministry of Health, Committee of Departmental Offices, Reports on Public Health and Medical Subjects, 95, 1954. Burrows, B., Kellogg, A. L., Buskey, J., Arch. Enuiron. Health 16,406-1 3 (1 968). Cassell, E., Wolter, D. W., Mountain, J. D., Diamond, J. R., Mountain, I. M., McCarroll, J. R., Amer. J. Pub. Health 58(9), 1653-7 (1968). Fleischer, P. E., in “System Analysis by Digital Computer,” Kuo, F. F., Kaiser, J. F., Eds., Wiley, New York, N.Y., 1966, p 180-203. SCI. TECHFriedlander, S. K., Ravimohan. A. L., ENVIRON. NOL. 2, 1101-8 (i968). Glasser. M.. Greenburg. L..’ Field., F.., Arch. Environ. Health 15(6),’ 684-94 (1 967). -’ Goldsmith, J. R., in “Air Pollution,” A. C. Stern, Ed., Academic Press, New York, N.Y., 1968. Hechter, H . H., Goldsmith, J. R., Amer. J. Med. Sci. 241, 65-72 (1961). Hildebrand, F. B., “Introduction to Numerical Analysis,” Chapter 2, McGraw-Hill, New York, N.Y., 1956. IBM Publication H20-0205-3, “System/360 Scientific Subroutine Package Version 111,” 1968, pp 276-9. h e n . Johannes. Ingenito. F. E.. Deane,. M.., Arch. Enuiron. Hehlth 18, 458-61-(1968). Martin, A. E., Proc. Roy. SOC.Med. 57, 969-75 (1964). Martin, A. E., Bradley, W. H., Monthly Bull. Ministry Health Lab. Seru. 19, 56 (1960). McCarroll, J., J. Air. Pollut. Contr. Assoc. 17, 203-9 (1967). Pauoulis. A.. “The Fourier Integral and Its Apulications.” McGraw-Hill, New York, N . Y , 1962, pp 1-16.Papoulis, A., “Probability, Random Variables, and Stochastic Processes,” McGraw-Hill, New York, N.Y., 1965, pp 284-7. Papoulis, A,, Proc. IEEE 54(7), 947-55 (1966). Scott, J. A., Proceedings of the Glasgow Conference, National Smoke Abatement Society, London, England, 1953, pp 2830. Secrest, Don, J . S I A M Nurner., Anal. B 2(3), 440-7 (1965). Shannon, C. E., Proc. IRE 37,lO-21 (1949). Sterling, T . D., Phair, J. J., Pollack, S. V., Schumsky, D . A., De Groot, I,, Arch. Environ. Health 13, 158-74 (1966). Sterling, T. D., Pollack, S. V., Phair, J. J., Arch Enuiron. Health 15, 362-74 (1967). Verma, M. P., Schilling, F. J., Becker, W. H., Arch. Environ. Health 18, 536-43 (1969). Weaver, N. K., J. Occ. Med. 11,455-61 (1969). Wilde, D. J., Beightler, C. S., “Foundations of Optimization,” Prentice-Hall, New York, N.Y., 1967, pp 288-98. Wilkinson, J, H., “The Algebraic Eigenvalue Problem,” Chapter 4, Oxford University Press, London, 1965. A
Received for reciew February 4, 1970. Accepted January 25, 1971.
Volume 5, Number 6 , June 1971 531