Statistical estimation of analytical data distributions and censored

Expected transformed values of censored points are computed from a straight line fitted to the transformed, accepted data, and these are then back-tra...
Statistical Estimation of Analytical Data Distributions and Censored Measurements Kirk K. Nielson* and Vern C. Rogers

Rogers and Associates Engineering Corporation, P.O. Box 330, Salt L a k e City, U t a h 84110-0330

A numerical method was developed for estimating the shapes of unknown distributions of analytical data and for estimating the expected values of censored data points. The method Is based conceptually on the normal probability plot. Data are ordered and then transformed by using a power function to achieve approximate linearity with respect to a computed normal cumulative probability scale. The exponent used in the power transformation is an index of the distribution shape, which covers a continuum on which normality is defined as d = 1 and log normality is defined as d = 0. Expected transformed values of censored points are computed from a straight line fitted to the transformed, accepted data, and these are then back-transformed to the original distribution. The method gives improved characterization of analytical data distributions, particularly in the distribution extremities. I t also avoids the biases from improper handling of censored data arising from measurements near the analytical detection limit. Illustrative applications were computed for atmospheric SO, data and for mineral concentrations in hamburgers.

INTRODUCTION In environmental surveillance of chemical compounds, trace elements, and radionuclides, analytical measurements often span very wide ranges, but still may represent a single population. T o represent the population by its measured parameters, a central value (mean, median, etc.) and distribution width (standard deviation, range, etc.) generally are sought for simplicity and convenience in statistical analyses. Two common causes of subjective bias in representing the population are the choice of distribution attributed to it and the method of dealing with censored data points. The distribution often is simply assumed to be normal or log-normal, even though it may have an intermediate or alternative shape. Even when distributions are analyzed by using normal or log-normal probability plots, their shapes still are usually assessed subjectively for linearity and for approximation to normality or log normality. While the biases resulting from inaccurate distribution assumptions affect both central value and width estimates, they can be particularly misleading when one is making decisions about confidence intervals or compliance with prescribed limits. The problem of censored data points also is common in environmental surveillance, where vanishingly small concentrations frequently lead to observations that are less than the analytical limit of detection (LD). As concentrations approach zero, the combined measurement uncertainties in samples and blanks can even cause a fraction of observations to be negative. Although there are numerical definitions of the appropriate LD for data acceptability (1-4), there is less consistency in reporting and interpretation of the lower, censored measurements. They variously are ignored or reported as zero, as LDb

meanc f SD

A1 Si

110 50 40 30 100 60 30 2 2 0.8 0.8 0.6 0.6 1

92 36 112 112 112 112 112 109 112 46 112 112 112 112

163 f 37 79 f 39 1920 f 350 2100 f 280 11800 f 1800 3190 f 780 1490 f 450 4.3 f 1.2 45 f 8 1.3 f 0.5 34 f 9 30 f 11 3.7 f 2.3 5.4 f 2.3


c1 K

Ca Mn Fe cu Zn

Br Rb Sr

power-transformed fit param re1 fite uncert, distribn indexd % medianc 0.6 f -0.1 f 0.6 f 1.8 f -2.3 f -1.6 f 1.2 f 0.1 f 0.2 f 4.1 f 0.2 f 1.1 f -0.5 f -0.9 f

0.7 0.5 0.5 0.6 0.7 0.5 0.3 0.4 0.5 0.8 0.3 0.2 0.2 0.3

1.7 8.9 1.9 2.9 1.9 1.9 12.3 3.0 1.3 5.3 2.5 18.9 7.4 14.1

147 37 1900 2120 11400 3030 1500 4.1 45 0.7 32 30 3.1 4.8

slopec 44 21 350 270 1400 540 430 1.1


0.4 9 11 1.3 1.6

log-normal fit param medianC slopec 146 35 1880 2080 11700 3120 1420 4.1 44 0.7 32 28 3.3 5.0

41 23 350 290 1600 620 450 1.1 8 0.4 9 12 1.4 1.8

2a analytical limit of detection based on X-ray peak counting statistics. Number of measured values above the LD. Concentrations and standard deviations in ppm dry weight. dComputed from eq 4-6 and 8. eComputed as ( x [ ( Y- Y f i t ) / V 2 / ( -n 1))0.5.

log-normal distribution and to a normal distribution that was transformed by using the distribution index d = 0.301 in eq 4. The distribution index was determined iteratively by using eq 4-6, with the same point weighting indicated in Figure 7 . The computed distribution index indicates the data actually have a distribution shape that is intermediate between lognormal (d = 0) and normal (d = 1). Although the offset used in the three-parameter log-normal analysis (16) facilitates a quasi-log-normal fit to the data with similar medians (0.032 for 3P log versus 0.031 for power-transformed normal), the resulting fit, shown in Figure 7 , is considerably different in the extremities, of the distribution. At the low end, the negative values suggested by the three-parameter log fit are reduced by the power-transformed fit, and a t the high end, the maximum values are predicted to be lower by the power-transformed fit. If data as in Figure 7 were used to estimate compliance with an SO2 standard of 0.4 ppm, the three-parameter lognormal distribution would indicate noncompliance about 3 times as often as the power-transformed normal distribution. In an application to the analysis of mineral concentrations in fast foods, the method was used to estimate distribution shapes and censored data ranges. X-ray fluorescence analyses of 14 minerals in 112 commercially obtained hamburger samples yielded the means and standard deviations presented in Table I1 for all measurements above the LD. The distribution indices given in Table I1 were computed for each element by subjecting these data to the distribution analyses defined by eq 4-6. Censored values of Al, Si, Mn, and Cu were computed from fits to the power-transformed data, and the resulting fitted values were transformed back to the original data distributions. The computed minimum values in these distributions were 49,10,1.9, and 0.2 ppm for Al, Si, Mn, and Cu, respectively. The wide variation in distribution indices for different elements in Table I1 results from the different modes of occurrence of the minerals and the different populations to which they belong. Bromine and calcium are approximately normally distributed (d = l ) , whereas Si, Mn, Fe, Cu, and Zn are nearly log-normally distributed (d = 0). Distributions of C1, K , Rb, and Sr are skewed positively even more than a log-normal distribution, and S is skewed negatively from a normal distribution. The distributions of A1 and P are intermediate between normal and log-normal distributions. For comparison with the arithmetic means and standard deviations, the fitted intercepts (medians) and slopes (standard deviations) of the data also are presented in Table 11. For

