Statistics for Critical Clinical Decision Making Based on Readings of

Amperometric Glucose Sensors: Sources of Error and Potential Benefit of Redundancy. Jessica R. Castle , W. Kenneth Ward. Journal of Diabetes Science a...
0 downloads 0 Views 144KB Size
Anal. Chem. 1996, 68, 2845-2849

Statistics for Critical Clinical Decision Making Based on Readings of Pairs of Implanted Sensors David W. Schmidtke, Michael V. Pishko,† Christopher P. Quinn, and Adam Heller*

Departments of Chemical Engineering, The University of Texas at Austin, Austin, Texas 78712-1062

Low error rates are essential if lives of patients are to depend on readings of implanted sensors, such as glucose sensors in insulin-dependent diabetic patients. To verify the operation and to calibrate on demand an implanted sensor, it is necessary that calibration through a single, independent measurement involving withdrawal of only one sample of blood and its independent analysis be feasible. Such a one-point calibration must be accurate. Borrowing from nuclear reactor safety assurance, where a likelihood ratio test is applied to readings of pairs of pressure sensors for shutdown/no shutdown decisions, we apply a similar test to sensor pairs implanted in rats. We show, for five sets of glucose sensor pairs, calibrated in vivo by withdrawal of a single sample of blood, that application of the likelihood ratio test increases the fraction of the clinically correct readings from 92.4% for their averaged readings to 98.8%. We consider the statistics for treating readings of pairs of glucose sensors implanted in a diabetic patient. The acceptance of false readings can be life-threatening, for example, if the blood glucose level is, in truth, low, but the readings are falsely high and the patient is injected with insulin. Preimplantation calibration of a sensor in a physiological buffer solution to which glucose is added may not be valid after implantation. Both the intercept, meaning the signal at a vanishingly small glucose concentration, and the sensitivity, meaning the variation of signal with glucose concentration, can change. Recently, we have shown that it is possible to build glucose-specific amperometric glucose sensors with intercepts close to nil, whose signals are not affected by the presence of the ensemble of electroreducible and electrooxidizable molecules and ions in the subcutaneous fluid.1 Such sensors can be calibrated in vivo by a one-point in vivo calibration. Independent analysis of a single withdrawn blood sample provides one calibration point. The second calibration point is, by definition, the origin. When the measured glucose electrooxidation current varies linearly with the blood glucose level, the calibration curve is then line connecting the calibration point with the origin. Because a patient’s life may depend on the validity of such a calibration, it is essential that more than one sensor be used to increase the accuracy of the measurement. Furthermore, a statistical test must be applied for acceptance or rejection of the readings of the redundant sensors, particularly the acceptance or rejection of those readings used for one-point in vivo calibration. † Current address: Massachusetts Institute of Technology, 45 Carlton St., Cambridge, MA 02130. (1) Csoregi, E.; Quinn, C. P.; Schmidtke, D. W.; Lindquist, S. E.; Pishko, M. V.; Ye, L.; Katakis, I.; Hubbell, J. A.; Heller, A. Anal. Chem. 1994, 66, 31313138.

S0003-2700(96)00202-8 CCC: $12.00

© 1996 American Chemical Society

The test chosen must reject even those sets of redundant readings having a small likelihood of being false and leading to inappropriate treatment, such as, insulin injection when glucose ingestion is warranted, or glucose ingestion when insulin injection is warranted. When the readings of pairs of subcutaneously implanted glucose sensors are compared, then, in actuality, a hypothesiss which may or may not be validsis tested. According to the null hypothesis, the readings of the two sensors should be equal; the hypothesis now is that there is no statistically significant difference between the two measurements. In a statistical test for the validity of the null hypothesis, two types of errors can be made: the first, type I error, is rejection of the null hypothesis when, in fact, it is valid. This is often called a false positive. It leads to the conclusion that the sensor measurements differ when, in truth, they are statistically the same. The probability of occurrence of a type I error is termed R. The second possible error type, termed type II error, is acceptance of the null hypothesis as valid when, in fact, it is false. It is assumed, for example, that the readings of two sensors in their respective environments should be statistically the same, when, in fact, they are statistically different, even though they could be different for reasons such as transient blockage of glucose transport to one of the sensors or to progressive corrosion or change in resistance of one of its electrical contacts. This probability is known as a false negative and is termed β. What are the implications of type I and type II errors in the treatment of diabetic patients? A type I error is rather benign. It merely leads to discarding of readings of the sensor pair, when, in fact, the readings were valid and should have been accepted. In contrast, type II errors can have severe consequences. In the most extreme cases, they can lead to treating with insulin a hypoglycemic patient or feeding glucose to a hyperglycemic patient. The uniqueness of the probability density ratio test that we apply is that (a) it takes into account the likelihood of both type I and type II errors when making a decision to “accept” or “reject” and (b) skews the likelihood in favor of making a type I error while reducing the likelihood of making a type II error. The test is used by the nuclear power industry because it is skewed on the side of safetyssacrificing valid readings so that no alarmwarranting readings will be missed. The likelihood ratio test can be contrasted with tests where reject decisions are based on either the absolute or percent difference between readings of two sensors and where the difference in the medical implications of type I and type II errors is disregarded. Analytical Chemistry, Vol. 68, No. 17, September 1, 1996 2845

EXPERIMENTAL SECTION Reagents. D-Glucose, ascorbic acid, and L-(+)-lactic acid were supplied by Sigma Chemical Co.(St. Louis, MO). All chemicals were used as received. The ascorbic and lactic acids were dissolved in a 10 mM phosphate-buffered saline (PBS) (0.15 M NaCl , pH 7.4) and prepared prior to use. Glucose stock solutions (2 M) were prepared in either water or PBS and were allowed to mutarotate overnight at room temperature and subsequently stored at 4 °C. Electrode Preparation. The sensor was fabricated as earlier described.1 It had a recessed gold wire tip with four layers, a sensing layer, a mass-transport-controlling and electrically insulating barrier layer, an interference-eliminating layer that enzymatically oxidized electrooxidizable interferants before they reached the sensing layer, and a biocompatible layer that prevents the fouling of the sensor. The sensing layer was made by crosslinking the redox polymer PVI-Os4 and a recombinant glucose oxidase (rGOX, 35% purity, Chiron Corp., Emeryville, CA) with poly(ethylene glycol) diglycidyl ether 400 (PEGDGE, Polysciences, Warrington, PA). The barrier layer between the sensing and the interference-eliminating layers was formed of polyallylamine (PAL, Polysciences), cross-linked with a polyfunctional aziridine (PAZ, XAMA-7, Virginia Chemicals, Portsmouth, VA). The interferenceeliminating layer5,6 was made by co-immobilizing horseradish peroxidase (HRP, Boehringer-Mannheim, Indianapolis, IN) with recombinant lactate oxidase (rLOX, Genzyme, Cambridge, MA) by first oxidizing the HRP with sodium periodate (Sigma) and then forming Schiff bases between the polyaldehyde formed and peripheral amine functions of the enzymes. The outer biocompatible hydrogel membrane7 was made by photopolymerizing a 10% tetraacrylated poly(ethylene glycol) solution in the presence of the photoinitiator 2,2-dimethoxy-2-phenylacetophenone by exposure to ultraviolet light. In Vitro Experiments. Prior to implantation, the sensitivities of the sensors were determined, and their insensitivity to electrooxidizable interferants was confirmed in a three-electrode cell that contained pH 7.4 PBS and had a saturated calomel reference electrode (SCE), a platinum counter electrode, and a glucose sensor as working electrode. The sensing electrode was poised at 300 mV vs SCE, and the cell was maintained at 37 ( 0.5 °C. The current outputs of sensors used were independent of the presence of interferants. The average apparent MichaelisMenten constant, Km, of the 10 sensors, determined from their Eadie-Hoffstee plots prior to implantation, was 16.6 ( 8.3 mM and their average sensitivity at 10 mM glucose was 1.5 ( 0.5 nA/ mM. The average loss of sensitivity in the explanted (vs the implanted) sensors was 20% ( 2%. There was no correlation between the loss in sensitivity and the duration of the implantation. In Vivo Experiments. In vivo experiments, 6-10 h long, were carried out in 300 g male Sprague-Dawley rats. The rats were fasted overnight and were anesthetized with an intraperitoneal (ip) injection of sodium pentobarbital (65 mg/kg). An ip injection of atropine sulfate dissolved in PBS (166 mg/kg) was then administered to suppress respiratory depression. Once the (2) Gross, K. C.; Humenik, K. E. Nucl. Technol. 1991, 93, 131-137. (3) Wald, A.; Wolfowitz, J. Ann. Math. Stat. 1948, 19, 326-339. (4) Ohara, T.; Rajagopalan, R.; Heller, A. Anal Chem. 1993, 65, 3512-3517. (5) Maidan, R.; Heller, A. Anal. Chem. 1992, 64, 2889-2896. (6) Maidan, R.; Heller, A. J. Am. Chem. Soc. 1991, 113, 9003-9004. (7) Quinn, C. P.; Pathak C. P.; Heller A.; Hubbell J. A. Biomaterials 1995, 16, 389-396.

2846

Analytical Chemistry, Vol. 68, No. 17, September 1, 1996

rat was anesthetized, a portion of the rat’s abdomen was shaved and coated with an electrode gel, and an Ag/AgCl reference electrode (Microelectrodes Inc., Londonderry, NH) was placed on the shaved surface of its skin. Two wire sensors were then implanted subcutaneously, using a 22 gauge Per-Q-Cath introducer (Gesco International, San Antonio, TX). One sensor was placed near the right scapula, the other near the left scapula, and the exterior part of the wires were then taped to the skin to avoid dislodgement. The wires, along with the reference electrode, were connected to a Model 400 PAR bipotentiostat. The potential at which the sensors were poised was 0.3 V vs the Ag/AgCl electrode. The currents were recorded using a data logger and were transferred at the end of the experiment to a computer. The sensors were allowed to reach a basal signal level for at least 1 h before blood sampling was started. Blood samples were obtained from the rat’s tail, and all blood samples were analyzed using a YSI Model 23A glucose analyzer. Approximately 30 min after the start of blood sampling, the ip glucose infusion was started, using a syringe pump (Harvard Apparatus), at a rate of 120 mg of glucose min-1 kg-1. The glucose infusion was maintained for ∼1 h. In four of the experiments, 1-2 units of bovine pancreas insulin dissolved in PBS was injected intraperitoneally after the blood glucose concentration had reached a maximum. At the end of the experiment, the rat was euthanized by sodium pentobarbital injection ip or asphyxiation by CO2, consistent with the recommendations of the panel on Euthanasia of the American Veterinary Association. The in vivo experiments were approved by the University of Texas Institutional Animal Use and Care Committee. Statistical Analysis. The means ( SEM of all data are given where appropriate. One-point calibrations were performed by assuming a valid zero point (meaning zero current at zero glucose concentration) and calculating a sensitivity coefficient from the YSI blood glucose measurement and the current of the sensor at the time of withdrawal of the blood sample. This sensitivity coefficient was then used to convert the sensor’s current readings into estimated subcutaneous glucose concentrations. RESULTS AND DISCUSSION A probability ratio test was used to statistically determine if the difference, y ) |g1 - g2|, between the two sensors’ glucose readings (g1 and g2) at time t was statistically significant. Readings g1 and g2 were based on one-point calibrations of the two sensors. The algorithm, used to reject risky measurement points, tests for the validity of two hypotheses. Test hypothesis 1, H1, examines the set for a statistically significant difference in readings of sensor pairs, greater than |M|, where M is the system disturbance magnitude, meaning the difference in glucose readings above which (+M) or below which (-M) the change is considered a clinically significant disturbance. Test hypothesis 2, H2, examines the set for the probability of statistically no difference between the signals of the two sensors.

H1: y is drawn from a Gaussian probability density function (pdf) with mean M and variance σ2 H2: y is drawn from a Gaussian pdf with mean 0 and variance σ2 The test is then performed by examining the likelihood of a significant difference of magnitude M in sensor readings compared

to the likelihood of no difference. The likelihood ratio, R, is defined as

R)

f(y|H1)

(1)

f(y|H2)

where f(y|H1) is the probability that H1 is true and f(y|H2)is the probability that H2 is true. We then define error probabilities to determine whether H1 or H2 is true:

H1 is decided with probability 1 - β H2 is decided with probability 1 - R where, as stated in the introduction, R is the probability of accepting H1 when H2 is true. This is the probability of a false alarm, meaning the probability for a valid glucose measurement being rejected when it should have been accepted. Similarly, β is the probability of accepting H2 when H1 is true. This is the probability of a missed alarm, meaning the probability of a glucose measurement being accepting when it should have been rejected. For any particular value of R, there are three possibilities: (1) that hypothesis 1 is true, (2) that hypothesis 2 is true, and (3) that statistically neither hypothesis 1 nor hypothesis 2 is true. Here, we are only concerned with whether or not it is true that the difference between the readings of the two sensors warrants their rejection; this is the case when hypothesis 1 is true. The thresholds for accepting a particular hypothesis are then related to the error probabilities by the following expressions:

accept H1 if R g

1-β R

(2)

accept H2 if R e

β 1-R

(3)

We set the value of R, the probability of a valid glucose reading being rejected when it should have been accepted, at a relatively high value, 0.5, because there is no particular clinical penalty associated with rejecting a correct reading. At the same time, we set the value of β, the probability of accepting a glucose reading when it should have been rejected, at a low value, 0.05, because accepting a reading that should have been rejected may endanger the life of the diabetic patient, the potential user of the algorithm. Assuming that y is normally distributed, the likelihood that H1 is true is given by

L(y|H1) )

[

]

1 -1 exp 2(y2 - 2yM + M2) 1/2 (2π) σ 2σ

(4)

Similarly, the likelihood that H2 is true is given by

L(y|H2) )

[

]

1 -1 exp 2(y2) (2π)1/2σ 2σ

(5)

The likelihood ratio, R, is then given by the ratio of eqs 4 and 5:

[

]

-1 R ) exp 2(M(M - 2y)) 2σ

(6)

Figure 1. Variation of the output currents of two subcutaneously implanted sensors during and after intraperitoneal infusion of glucose in a rat. Line A represents the uncorrected signal of sensor 1, while line B represents the normalized signal of sensor 2. The symbols (b) represent glucose concentrations in blood samples from the tail vein of the rat, measured with a YSI glucose analyzer.

Figure 2. Clarke-type clinical error grid for the experiment of Figure 1, using each of the 33 withdrawn blood samples for one-point calibration of either of the two sensors. The distribution of the possible 2178 readings is shown.

To test the usefulness of the proposed algorithm, measurements were performed using five pairs of the previously described microwire sensors1 implanted subcutaneously in rats. Figure 1 shows the results of one of the experiments, where the independent blood glucose analyses of periodically withdrawn blood samples were assumed to be true and the validity of the continuous subcutaneous readings was tested. Figure 2 evaluates the clinical accuracy of the measurements of Figure 1 using the error grid analysis of Clarke et al.9 Clarke et al. divide a grid, where the x-axis is the true glucose concentration and the y-axis is the estimate from a reading of a sensor, into five zones. This division defines for the clinician not only the magnitude of the error but also its clinical implication. Readings in zone A are accurate; in (8) Consensus Development Panel. Consensus statement on self-monitoring of blood glucose. Diabetes Care 1987, 10, 95-99. (9) Clarke, W. L.; Cox, D.; Gonder-Frederick, L. A.; Carter, W.; Pohl, S. L. Diabetes Care 1987, 5, 622-627.

Analytical Chemistry, Vol. 68, No. 17, September 1, 1996

2847

Table 1. Distribution of One-Point Calibration-Based Glucose Readings in a Clarke-Type Grid for Individual Implanted Sensors sensor

A

B

C

D

E

1 2 3 4 5 6 7 8 9 10

54.5 53.5 83.6 54.0 94.0 77.9 69.4 62.2 86.7 52.6

32.2 33.9 13.3 37.6 5.7 19.4 21.7 27.6 12.8 24.5

0.0 8.8 0.0 0.0 0.0 2.2 0.0 1.3 0.0 1.5

13.3 3.1 3.2 8.4 0.3 0.6 8.9 8.7 0.5 21.4

0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0

av

68.8

22.9

1.4

6.9

0.1

zone B, clinically acceptable; those in zones C, D, and E lead to increasingly inappropriate and then dangerous treatments. For the experiment of Figure 1, there were 33 × 33 ) 1089 points for each sensor, or 2178 points for the two sensors. As can be seen in Figure 2, 86% of the points fell in zone A, 12.5% in zone B, 1.1% in zone C, 0.4% in zone D, and 0% in zone E. Thus, 98.5% of the points fell in the clinically acceptable zones A and B. The error grid analyses for all five implanted sensor pairs are shown in Table 1: 91.7% of the points fell in zones A and B (A, 68.8%; B, 22.9%), and 8.4% fell in zones C, D, and E. When the current of sensor 1 and the normalized current of sensor 2 were averaged and used for one-point calibration, the fraction of points in zone A increased only from 68.8% to 70.7%, and the fraction in the clinically acceptable zones A and B only from 91.7% to 92.4%. The accuracy of the sensor pairs was improved significantly upon using the probability ratio test, accepting only those reading pairs for which R was less than 1.9 (the value calculated for R ) 0.5 and β ) 0.05 in eq 2). The value of R (eq 6) was calculated as follows. In the first step, a likelihood ratio was calculated by preassigning the values: M ) 20% (of the average of the two implanted sensors readings) and σ2 ) 81 mg2/dL2. We set the value of M at 20% because most home glucose monitoring systems are only accurate to within 15%-20%. The basis for making M variable and increasing with the sensor readings was that, in the practice of diabetes management, the impact of the absolute error in the glucose estimate is greater at low glucose concentrations than at high ones, the absolute accuracy needed depending on the blood glucose concentration. Next, the current readings were converted into estimated glucose concentrations by a one-point calibration, based on the independent analyses of the very first blood sample withdrawn from each of the five rats. The differences, y, between the glucose readings of the pairs were then calculated. These values, along with the values of M and σ2, were substituted into eq 6, and probability ratios for each independent subsequent blood sampling point were calculated. All points with probability ratios greater than 1.9 were rejected. One-point calibrations were then performed using the averaged currents of the two sensors, the calibration points now limited to those passing the probability ratio test. Figure 3A shows, for the experiment of Figure 1, the glucose estimates of sensor 1 (O) and sensor 2 (b), both calibrated only once using the very first blood sample withdrawn. The corresponding values of R, calculated by using eq 6, are plotted in Figure 3B. Figure 4 shows the Clarke grid analysis of those points passing the probability ratio test. In this particular experiment, all points were in the clinically acceptable 2848 Analytical Chemistry, Vol. 68, No. 17, September 1, 1996

A

B

Figure 3. (A) Plot of the subcutaneous glucose concentrations, estimated through the normalized readings of the two sensors: (O) sensor 1 and (b) sensor 2. (B) Time dependence of the distribution of the points passing and failing the likelihood ratio threshold test.

Figure 4. Clarke-type error grid for the experiment of Figure 1, where only paired readings that survived the probability ratio test are used. The distribution of the possible 324 readings is shown.

region, the fraction of readings in the clinically correct Clarke zones A and B increasing from 98.5% to 100%. Since the output of the two subcutaneous sensors is read continuously, and because the shortest time interval between the

readings that is relevant in diabetes management is >10 min, it is statistically unlikely that readings of two properly functioning sensors will differ for longer than the 10 min clinically relevant period. If the difference is, however, prolonged, it should and will be interpreted as a malfunctioning sensor. In this case, an alarm would be triggered in a practical system. In performing a one-point calibration, detection of a significant difference will tell the patient to discard the calibration point, draw another blood sample, and recalibrate. Table 2 shows the results of the error grid analysis for the five pairs of sensors in the five rats when the probability ratio test is applied. The percentage of points falling in zones A and B increased from 92.4% to 98.8%. A one-tailed paired Student’s t-test showed that this increase in accuracy was statistically significant (p ) 0.025).10 CONCLUSIONS We show that the clinical accuracy of in vivo glucose readings is significantly improved by using sensor pairs and applying a likelihood ratio test.2,3 The test identifies and rejects those readings of pairs of implanted sensors that are unsuitable for their one-point calibration in vivo. When only those readings that pass (10) The one-tailed test was used rather than a two-tailed test because it was expected that the error rate could only decrease by using the probability ratio test.

Table 2. Distribution of One-Point Calibration-Based Glucose Readings in a Clarke-Type Grid for Pairs of Implanted Sensor Readings, Rejecting Points Failing the Probability Ratio Test expt

A

B

C

D

E

1 2 3 4 5

100.0 95.7 92.6 74.2 100.0

0.0 4.3 7.4 20.0 0.0

0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 5.8 0.0

0.0 0.0 0.0 0.0 0.0

av

92.5

6.3

0.0

1.2

0.0

the test are used for calibration, the likelihood of a clinically significant error is substantially reduced. ACKNOWLEDGMENT The National Institutes of Health supported this work through Grant DK42015. The authors thank Dr. Timothy Ohara and Dr. Ravi Rajagopalan for the PVI-Os polymer and Dr. Elisabeth Cso¨regi for providing data for construction of the electrodes. Received for review February 29, 1996. Accepted May 31, 1996.X AC9602027 X

Abstract published in Advance ACS Abstracts, July 15, 1996.

Analytical Chemistry, Vol. 68, No. 17, September 1, 1996

2849