THE SELF-JUDGMENT PRINCIPLE IN SCIENTIFIC DATA

THE SELF-JUDGMENT PRINCIPLE IN SCIENTIFIC DATA PROCESSING. P. A. D. de Maine, R. D. Seawright. Ind. Eng. Chem. , 1963, 55 (4), pp 29–32. DOI: 10.102...
1 downloads 8 Views 3MB Size
P. A.

R.

D. D E M A I N E

D. S E A W R I G H T

THE SELF-JUDGMENT PRINCIPLE IN SCIENTIFIC D A T A PROCESSING A mathematical device which enables a digital I

computer to iudge compatibility o f data with many dzfleerent theories impartial digital computer method has been deA new, veloped, which can be used to test compatibility of

.

experimental results with any number of' relationships. Statistical bias due to gross errors in the data, round-off error, or spatial distribution of data points is eliminated by incorporating one semiautomatic and three automatic mathematical devices. This general scheme can be used for fitting theory to experimental data, resulting in more effective communication of experimental results. Problems of communication of results of scientific investigations have long plagued the physical sciences. The immense growth in the volume of literature during the past twenty years emphasizes several inherent deficiencies in the present methods of communication. Often only the conclusions of investigations are published, with consequent loss to other workers of the actual experimental data. Compatibility of data may be checked with only one or two of a large number of possible models and equations. I n many cases the experimenter fails to quote precise mathematical and experimental limits of error. A concise, analytically exact method for processing and tabulating data would greatly facilitate the spread of information. Recently we developed a computer method (7) which rapidly checks compatibility of data and theory. Thus the experimenter can test a large number of alternate equations within a reasonable length of time. Data Processing

Let us suppose that a researcher has collected twenty conjugate pairs of values to test a theory that quantity Y is linear in the quantity X. If the maximum permitted errors in each value of Y and X,set by the accuracy of the observations, are Yer and Xer respectively, the following sequence would be used to test data compatibility:

Use a graphical or statistical method to determine the slope m and intercept c of the best straight line, based on all data points. For each point calculate the actual error in Y ( d e u Y ) and in X (deux) from the best straight line. Discard all data points for which the actual error exceeds the corresponding maximum permitted error. Use only the accepted sets of conjugate data to repeat steps 1, 2, and 3. When only a few of the original twenty conjugate pairs of information are discarded, and they occur a t random, we say that these data are compatible with the c, to within the limits of experiequation, Y = mX mental error. O n the other hand if the pattern of discarded data points is not random, or if too many data points are discarded, these data are not compatible with the equation for the preselected limits of error. Obviously this procedure can be applied also to equations containing two or more independent variables.

+

Statistical Bias

I n any data-compatibility test the method used to fit the curve to the data points will determine the accuracy and validity of the analyses. Graphical methods of curve fitting suffer from several inherent weaknesses. The instrument reliability factors, which define the error limits for each data point, usually are not considered until the researcher is aware of the direct consequences of their application. Limits of reliability and error are often estimated from the final graph itself, and complete impartiality is not always achieved. These factors define the accuracy of the raw data itself, and are complicated functions of the instrument, operator, and so forth (7). Maximum permitted error for each measured quantity is calculated from the predetermined instrument reliability factors. This will determine the maximum permitted error in each parameter. Results of a n error analysis are seldom reported. Parameters without an error analysis are of little value to the researcher who may wish to examine the mathematical basis of a proposed theory. Figure 1 illustrates the consequences of omitting a n error analysis. (Continued on next page) VOL. 5 5

NO. 4 A P R I L 1 9 6 3

29

2’

Attempts to standardize curve-fitting procedures with statistical techniques based on the method of least squares also suffer from the weaknesses described above. ELen after rejecting “outliers” the fit is markedly sensitive to small displacements in individual data points if their values are not evenlb- spaced, even when these displacements are well within the limits of experimental error. I n a least squares analysis performed on a digital computer, three kinds of statistical bias ma)- result in an incorrect analysis. Round-off error occurs when the word size in the distributor of the digital computer is exceeded. Additional error can result from “outliers” or data points in gross error, and from unevenly spaced values for each variable. As the variables ma)- be complicated functions of several observables, this last kind of statistical bias cannot always be avoided by choosing the experimental conditions. Meaning of such terms as statistical bias, instrument reliability factors, maximum permitted errors, as defined in this paper, have been discussed in detail elsewhere ( I ) , They should not be confused with nonequivalent terms in conventional statistics.

1

Y 1’

I

X TRIAL NUMBER POINTS ACCEPTED NUMBER POINTS REJECTED

1

2

3

4

5

15

17

17

15

5

3

0

M e c h a n i s m s for Reducing Statistical Bias

m

0.664

0.920

0.988

1.010

c

3.95

0.984

0.290

0.176

Figure 7 . Slope and intercept of the linejtting these data must be found. Three points on the left are in gross error 20

15

OPERATOR DISCARD§ THESE

Y 10

POINTS WITH MECHANISM.

5

+

0

I

I

5

10

x

These points dejne t w o intersecting h i s . mechanism is cspaciall) suited to this analysts

Fzgurr 2.

I

15

20

T h e reject-reAtore ~~~

P. A . D.de M a i n e 2s a Professor c f Chernzstij at the Universzty of Missimppi. R. D . Seuwrzght, a student, zs an Assistant at the Computer Center. T h e c o r n p e r techniquer are a direct result of exjmmental research subported zn part by f u n d s from both the Petroleum Research Fund adminzstered b j the American Chemical Sncietj, and the United States Azr Force under Granl A F - A F O S R 62-19 AUTHOR

30

I n our laboratories four mathematical devices have been developed which collectively eliminate statistical bias. defined above, without the use of specialized weighting functions. These four devices are : Automatic Normalization Mechanism. This ic, a rescaling feature (illustrated in the table, page 32). The absolute maximum values for Y and X are automatically normalized to 1.0 or 0.1 before application of the reject-restore or automatic transposition mechanisms. After calculation of the slope and intercept, the system of data points and the parameters themselves are denormalized This device makes format statements in the computer programs unnecessary. Semiautomatic Reject-Restore Mechanism. After normalization of the coordinates, the computer is instructed to ignore certain data points during the initial curve fitting, and then to review whether or not the ignored points are within the corresponding maximum permitted errors of the median straight line. Data points falling within the error zone (boundcd by Y Yer, Y - Yer, X Xer, X - Xei) are restored to memory, and the process is repeated. This device is exceptionally useful in the analyses if the data points are described by intersecting curves. Automatic Transposition Mechanism. Aftcr application of the automatic normalization and semiautomatic reject-restore mechanisms, the point of origin of the axes is automatically transposed to the center of gravity of the accepted data points After the parameters have been calculated, the system is automatically transposed back to the original axes. This mechanism, first introduced by Sillen (2),is particularly useful for eliminating errors in analyses arising from the uneven spacing of data points. Automatic Data Scanning Procedure. This fully automatic mathrmatical device for selecting Prroneous

INDUSTRIAL A N D E N G I N E E R I N G CHEMISTRY

+

Applications

data points has been discussed elsewhere ( I ) , and no further discussion will be attempted here.

T o illustrate the elimination of round-off error without the use of “format,” twenty data points with values of both Y and X in the indicated range were fitted to a straight line by the method of least squares, with the use of programs with and without the automatic normalization mechanism incorporated (see table). I t is seen that without automatic normalization, round-off error occurred when the lowest value of the ordinates was below 10-7. These calculations were made on a computer with a fixed word length of ten digits. The versatility of a program with the self-judgment principle, automatic normalization, and automatic transposition mechanisms is illustrated in Figure 1. In these data we have allowed a maximum permitted error of one per cent in both X and Y and have introduced three data points (extreme left in Figure 1) in gross error. T o the computer we have posed the problem: ”Here are twenty data points which we believe fit a straight line to within the limits of experimental error. Some of these data points may be in gross error. Select the valid data points and calculate the correct slope and intercept.” The computer was monitored to complete four cycles. The results are shown in Figure 1. By the fourth cycle ( J = 4), only the three erroneous data points were discarded and the correct slope m and intercept c were calculated. The reject-restore mechanism was not invoked in these calculations. To illustrate the usefulness of the reject-restore mechanism for analyses of composite curves twenty data points, defining two intersecting straight lines, have been selected (Figure 2). T o process these data, the commands for the reject-restore mechanism ( 7 ) are selected so that during the first application of the selfjudgment principle all data points to the left of the vertical line are ignored. In the normal processing of the data points on the right, the point lying outside the error zone would be discarded. Next the computer is instructed to ignore all data points to the right of the vertical line and the left-hand data points are processed in the normal way. The exact position chosen for the vertical line in Figure 2 is not critical.

The Self-Judgment Principle This is a programmed system for data-compatibility tests which enables us to ask a computer such questions as “DO these values for Y and X fit this straight line to within the limits of experimental error?” or “Which of a series of relationships describes these data to within the predetermined instrument reliability factors?” I t is obvious that this principle can ultimately be extended so that raw data itself will be automatically processed for compatibility with all known theories before development of a new theory. Here we shall consider only the simplest form of the self-judgment principle. We shall omit reference to the four devices for reducing statistical bias but shall recognize that they are part of the programmed scheme. With the example of twenty conjugate pairs of values for Y and X , and the corresponding maximum permitted errors (Yer and Xer), the following sequence of events occurs : (1) The method of least squares is used to fit to the best straight line the data points not rejected by the reject-restore mechanism. T h e slope m and intercept c are calculated. (2) The actual error in each value of Y , devY, is calculated with the use of values for m, c, and X . Next d e u x is calculated using the values for m, c and Y . (3) All data-points for which deuY is greater than Yer or d e u x is greater than Xer are discarded. (4) The slope m and intercept c are calculated for the plot which best describes the remaining data points. (5) DeuY and d e u x are recalculated for all data points with the new values for rn and c. (6) All data points for which devY is less than Yer and d e u x is less than Xer are restored. ( 7 ) With all accepted data points steps 4, 5 , and 6 are repeated. (8) Steps 1 to 7 are repeated J times, where J is the desired number of cycles. (9) The maximum permitted errors in the slope (Am) and intercept (Ac) are calculated from the equation :

Yer - m .Xer = Am.X

LITERATURE CITED

+ Ac

(1) De Maine, P. A. D., Seawright, R. D., “Digital Computer Programs for Physical Chemistry,” Macmillan, 1963. (2) Sillen, L. G., Acta Chem. Scand. 16, 159-72 (1962).

This final step is the error analyses.

70-10

Range of Ordinates

Correct values With automatic normalization Con vs-n ti o na I Ieast s q ua res

to

10-0

10-4

lo-’ to lo-’

10-8 to 10-2

to 10-3

m

G

m

G

1.5000

3.0000 3.0000

1.5000 1.5000

3.0000 3.0000

m

1.5000 1.5000

3.0000 3.0000

~

1.5000

1.5000

’ ~

C

3.0000 3.0000