Establishing Clinical Detection Limits of Laboratory Tests - ACS

Dec 9, 1987 - Clinical Pathology Department, Clinical Center, National Institutes of Health, Bethesda, MD 20892. Detection in Analytical Chemistry. Ch...
0 downloads 0 Views 2MB Size
Chapter 8

Downloaded via TUFTS UNIV on July 12, 2018 at 16:59:27 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

Establishing Clinical Detection Limits of Laboratory Tests Mark H. Zweig Clinical Pathology Department, Clinical Center, National Institutes of Health, Bethesda, MD 20892

Fundamental clinical laboratory test performance can be described in terms of accuracy, or the ability to correctly classify subjects into clinically relevant subgroups. Receiver operating characteristic (ROC) curves demonstrate the limits of a given test to detect the alternative states of interest over the complete spectrum of operating conditions, providing a comprehensive and pure index of accuracy. Obtaining valid data for ROC analysis requires attention to the following important steps: (1) define carefully the specific clinical question to be addressed; (2) choose subjects who are representative of the population to which the test is ultimately to be applied; (3) perform all tests being evaluated on all subjects; (4) determine the "true" diagnosis by rigorous and complete means independent of the test(s) being studied; and (5) evaluate and compare test performance at all decision levels using ROC curves. Swets and Pickett ( 1 ) divide t e s t performance into a discrimination or accuracy aspect and a decision or e f f i c a c y aspect. Accuracy, on the one hand, r e f e r s to the a b i l i t y of the test to c l a s s i f y , to c o r r e c t l y discriminate between alternative c l i n i c a l states of the subjects under study ( i . e . , signals vs. noise, disease vs. non-disease, chest pain with myocardial i n f a r c t i o n vs. chest pain without i n f a r c t i o n , blood i n stools due to malignancy vs. blood i n stools from other conditions). This i s accuracy or correctness r e l a t i v e to truth, as best as we can determine that truth. We can express accuracy as c l i n i c a l s e n s i t i v i t y and s p e c i f i c i t y . E f f i c a c y , on the other hand, i s a measure of the actual p r a c t i c a l value of the diagnostic information or c l a s s i f i c a t i o n - how much

This chapter not subject to U.S. copyright Published 1988 American Chemical Society

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

150

DETECTION IN ANALYTICAL CHEMISTRY

b e n e f i t t h e t e s t p r o v i d e s r e l a t i v e t o i t s r i s k s and c o s t s . E v a l u a t i n g o r o p t i m i z i n g e f f i c a c y i n v o l v e s d e c i s i o n t h e o r y and c o n s i d e r a t i o n of the complexities of c l i n i c a l u t i l i t y , r a t h e r than j u s t accuracy. T h i s i s p a r t o f a symposium on d e t e c t i o n l i m i t s . In this paper I w i l l c o n s i d e r l i m i t s i n terras o f c l i n i c a l d e t e c t i o n r a t h e r t h a n a n a l y t i c a l d e t e c t i o n . By c l i n i c a l d e t e c t i o n I mean a c c u r a c y or the d i s c r i m i n a t i n g a b i l i t y r e f e r r e d t o i n the preceding p a r a g r a p h . T h i s a b i l i t y o f a t e s t , e x p r e s s e d as s e n s i t i v i t y and s p e c i f i c i t y , i s n i c e l y d e s c r i b e d and a p p r e c i a t e d u s i n g t h e r e c e i v e r o p e r a t i n g c h a r a c t e r i s t i c (ROC) c u r v e because i t p r o v i d e s a pure index of accuracy, of d i s c r i m i n a t i o n a b i l i t y . I t deals w i t h s i g n a l d e t e c t i o n and t h e a b i l i t y t o d i s t i n g u i s h s i g n a l f r o m noise. The i n d e x o f a c c u r a c y p r o v i d e d i s independent o f any d e c i s i o n c r i t e r i o n w h i c h might be a p p l i e d o r o f any b i a s w h i c h t h e system might have toward one d e c i s i o n o r a n o t h e r . Thus t h e d e c i s i o n a s p e c t , w h i c h i n v o l v e s c o s t s , b e n e f i t s , and outcomes i s s e p a r a t e d o u t so as n o t t o confound t h e assessment o f t h e i n t r i n s i c a b i l i t y o f t h e t e s t t o d i s c r i m i n a t e among v a r i o u s s t a t e s . The i n f l u e n c e o f v a r i o u s d e c i s i o n f a c t o r s ( p r e v a l e n c e , u t i l i t i e s ) on t h e o p e r a t i o n and u l t i m a t e e f f i c a c y o f t h e t e s t i s a d d r e s s e d by c l i n i c a l d e c i s i o n a n a l y s i s . The f o r m a l t o o l o f c l i n i c a l d e c i s i o n a n a l y s i s j o i n s the estimates of the p r o b a b i l i t i e s o f t e s t outcomes ( t r u e p o s i t i v e s , f a l s e p o s i t i v e s , e t c . ) p r o v i d e d by ROC a n a l y s i s w i t h d e c i s i o n f a c t o r s so as t o e s t a b l i s h t h e d e c i s i o n c r i t e r i o n f o r t e s t s and t o choose t h e s e t and o r d e r o f d i a g n o s t i c and t h e r a p e u t i c s t e p s t o be t a k e n t o o p t i m i z e t h e outcome i n terms o f y e a r s o f l i f e , q u a l i t y o f l i f e , c o s t s , resource u t i l i z a t i o n , e t c . (2-3). The b a s i c j o b o f a c l i n i c a l l a b o r a t o r y t e s t i s t o p r o v i d e i n f o r m a t i o n about t h e c l i n i c a l s t a t e o f p a t i e n t s f o r h e a l t h c a r e management p u r p o s e s . The g o a l then i s t o s u b d i v i d e o r c l a s s i f y s e e m i n g l y s i m i l a r s u b j e c t s i n t o c l i n i c a l l y r e l e v a n t management subgroups. Suppose we a r e t a l k i n g about p e o p l e who come t o an emergency room w i t h a c u t e c h e s t p a i n . Some w i l l t u r n o u t t o be h a v i n g a h e a r t a t t a c k and some won't. L a b o r a t o r y t e s t s h e l p d i v i d e o r c l a s s i f y t h o s e p a t i e n t s i n t o subgroups - t h a t i s , l a b t e s t s h e l p t o d i s t i n g u i s h t h o s e who p r o b a b l y a r e h a v i n g a h e a r t a t t a c k from those who a r e n ' t . The q u e s t i o n i s , what i s t h e l i m i t of t h e a b i l i t y o f t h e t e s t t o i d e n t i f y o r d e t e c t s u b j e c t s h a v i n g a h e a r t a t t a c k among t h o s e w i t h c h e s t p a i n ? What a r e t h e l i m i t s o f t h e t e s t ' s powers t o d e t e c t a c c u r a t e l y t h e c l i n i c a l s t a t e o f each i n d i v i d u a l i n t h e group? T h i s i s a s i g n a l d e t e c t i o n t h e o r y i s s u e . Most d i a g n o s t i c t e s t s a r e i m p e r f e c t and, p a r t i c u l a r l y when we use a b i n a r y approach - r e s u l t s a r e e i t h e r " p o s i t i v e " o r " n e g a t i v e " - t h e r e a r e some m i s c l a s s i f i c a t i o n e r r o r s , i n a c c u r a c i e s . Some s u b j e c t s w i t h t h e c o n d i t i o n o f i n t e r e s t w i l l be m i s s e d o r some w i t h o u t t h e c o n d i t i o n w i l l be m i s t a k e n l y c o n s i d e r e d a f f e c t e d , o r b o t h w i l l happen. The a b i l i t y o f a t e s t to properly i d e n t i f y or c l a s s i f y subjects o r conditions of i n t e r e s t can be e x p r e s s e d as t h e s e n s i t i v i t y and s p e c i f i c i t y o f t h e t e s t . F o r c l i n i c a l purposes t h e s e a r e d e f i n e d as f o l l o w s : SENSITIVITY (TRUE POSITIVE RATE): F r a c t i o n o f a l l a f f e c t e d

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

8. ZWEIG

Establishing Clinical Detection Limits of Laboratory Tests

subjects i n whom the test result i s p o s i t i v e ; "test p o s i t i v i t y i n the presence of the disease." SPECIFICITY (TRUE NEGATIVE RATE): Fraction of a l l unaffected subjects i n whom the test result i s negative; "test negativity i n the absence of the condition." These inaccuracies i n terms of s e n s i t i v i t y and s p e c i f i c i t y can be s t a t i s t i c a l l y represented by the ROC curve. This paper w i l l discuss basic test performance i n terms of accuracy, but w i l l not deal with actual a p p l i c a t i o n of a test. The l a t t e r involves choosing decision levels ( i . e . , reference values, cut-offs, normal l i m i t s , etc.) and involves measures of u t i l i t y which are beyond the scope of fundamental test performance. I w i l l describe a set of p r i n c i p l e s or elements important f o r evaluating test performance and comparing tests to one another (4,5). I w i l l p a r t i c u l a r l y emphasize the power and convenience of ROC curves, an extremely e f f e c t i v e t o o l for assessing and comparing tests (2,3,6-8). While the power and usefulness of ROC curves has been recognized and discussed by members of various biomedical d i s c i p l i n e s i n recent years, t h i s t o o l has received l i t t l e attention from the c l i n i c a l laboratory community. SiRnal/Noise Discrimination:

H i s t o r i c a l Perspectives

The ROC curve apparently had i t s origins i n e l e c t r o n i c signal detection theory. Much of t h i s arose i n the 1940's and 1950's from analysis of radar systems. During WWII, radar operators watched screens f o r b l i p s which might indicate enemy a i r c r a f t f o r the purpose of deciding when to mobilize f i g h t e r squadrons to intercept. The problem was to d i s t i n g u i s h between signals from h o s t i l e planes and noise from clouds, flocks of b i r d s , etc. They realized that i n interpreting the radar signals they saw there was always a trade-off between s e n s i t i v i t y and s p e c i f i c i t y as the s e n s i t i v i t y increased so did the rate of f a l s e p o s i t i v e s . That i s , i f they lowered the threshold f o r which b l i p s they interpreted as s i g n i f y i n g enemy planes, they f a l s e l y i d e n t i f i e d clouds and migrating b i r d s , etc., as planes more often. S p e c i f i c i t y declined and they scrambled interceptor squadrons unnecessarily. On the other hand, r a i s i n g the threshold f o r c a l l i n g a b l i p " p o s i t i v e " (enemy bombers) meant not responding to the a r r i v a l of enemy a i r c r a f t in some instances (false negatives). They were experiencing the trade-off between s e n s i t i v i t y and s p e c i f i c i t y inherent i n t e s t systems. Figure 1 shows hypothetical signals and noise i n the form of peaks. Imagine t h i s i s radar information and the r e a l planes give peaks I, I I , and I I I . I f interceptor planes are sent up when the signal exceeds c r i t e r i o n C, then two r e a l signals, I and I I , w i l l be missed. However, i f c r i t e r i o n A i s used so as to catch a l l three r e a l signals of enemy a i r c r a f t , a number of noise a r t i f a c t s w i l l be incorrectly c l a s s i f i e d as p o s i t i v e s (false p o s i t i v e s ) .

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

151

152

DETECTION IN ANALYTICAL CHEMISTRY

Signal/Noise Discrimination i n the C l i n i c a l Laboratory Figure 2 i l l u s t r a t e s t h i s i n the form of serum myoglobin concentrations obtained 5 hours after the onset of chest pain from patients admitted to a coronary care u n i t with the suspicion of myocardial i n f a r c t i o n . This test has been proposed by some as a marker f o r heart attacks. Some of these patients turned out to have a heart attack ( s o l i d bars) and some didn't (hatched bars). Because of the overlap between "signals" and "noise," any decision c r i t e r i o n we choose w i l l r e s u l t i n some m i s c l a s s i f i c a t i o n s . We could choose any of various decision l e v e l s , each giving a d i f f e r e n t s e n s i t i v i t y / s p e c i f i c i t y combination - a l l of the possible combinations comprising the trade-offs available with t h i s test. This spectrum of trade-offs constitutes the detection l i m i t of t h i s t e s t and i s represented by the ROC curve. We have defined the a b i l i t y to i d e n t i f y affected individuals as s e n s i t i v i t y and the a b i l i t y to recognize unaffected individuals as s p e c i f i c i t y and can express these a b i l i t i e s as percentages or decimal f r a c t i o n s . A perfect test would exhibit both a s e n s i t i v i t y and s p e c i f i c i t y of 100% or 1.0. Tests are rarely perfect. I t would be rather unusual f o r a test to exhibit a s e n s i t i v i t y and a s p e c i f i c i t y of 100% at the same time. Often we hear or read that a p a r t i c u l a r test has a p a r t i c u l a r s e n s i t i v i t y or s p e c i f i c i t y . In r e a l i t y , as noted with radar and serum myoglobin, there i s n t j u s t one s e n s i t i v i t y or s p e c i f i c i t y f o r a t e s t , but rather a continuum of s e n s i t i v i t i e s and s p e c i f i c i t i e s . By varying the decision l e v e l (or "decision point," "upper limit-of-normal," "cutoff value," "reference value," e t c . ) , any s e n s i t i v i t y from 0 to 100% can be obtained. Each of these s e n s i t i v i t i e s w i l l have a corresponding s p e c i f i c i t y . Sensitivity and s p e c i f i c i t y occur, then, i n p a i r s . The test's accuracy i s r e f l e c t e d i n the pairs that can occur; not a l l pairs are possible f o r a p a r t i c u l a r t e s t . A given test w i l l have one set of s e n s i t i v i t y - s p e c i f i c i t y pairs i n one c l i n i c a l s i t u a t i o n , but may have a d i f f e r e n t set of pairs when applied to another c l i n i c a l s i t u a t i o n where the group tested i s d i f f e r e n t . The spectrum of p a i r s exhibited by a t e s t i n a given c l i n i c a l setting characterizes or describes the accuracy of the test. Often test users i m p l i c i t l y assume one s e n s i t i v i t y - s p e c i f i c i t y p a i r characterizes a t e s t because they accept a conventional, often a r b i t r a r i l y chosen, upper-limit-of-normal as the single correct decision l e v e l f o r that test for a l l circumstances. They accept the corresponding s e n s i t i v i t y - s p e c i f i c i t y p a i r as the correct one f o r the t e s t . This, however, i s actually only one of multiple possible operating points for the t e s t . When the concept of varying the decision l e v e l (operating point) to generate a spectrum of s e n s i t i v i t y - s p e c i f i c i t y pairs i s understood, then the issue becomes: How good are the pairs? Also, which p a i r ( s ) works the best for the circumstances i n which the test i s to be used? f

T

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

8.

ZWEIG

Establishing Clinical Detection Limits of Laboratory Tests

F i g u r e 1:

Diagram of a h y p o t h e t i c a l s e t of peaks from a r a d a r r e c e i v e r . Peaks I , I I , and I I I r e p r e s e n t s i g n a l s from a i r c r a f t , w h i l e a l l o t h e r peaks r e p r e s e n t n o i s e . L i n e s A, B, C, and D r e p r e s e n t i n c r e a s i n g d e c i s i o n l e v e l t h r e s h o l d s , which r e s u l t s i n s u c c e s s i v e l y lower t r u e - and f a l s e - p o s i t i v e r a t e s .

500 Γ CD :200

< F ζ

100

LU

υ Ζ Ο ο ω

3

50

20

Ο > 10 Έ Έ D ce

LU CO

INDIVIDUAL PATIENTS F i g u r e 2:

Serum m y o g l o b i n c o n c e n t r a t i o n s f o r 54 p a t i e n t s w i t h chest pain admitted to a c o r o n a r y c a r e u n i t . M y o g l o b i n was measured 5 h o u r s a f t e r the o n s e t of p a i n . Solid bars: acute myocardial i n f a r c t . C r o s s h a t c h e d b a r s : no i n f a r c t .

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

153

DETECTION IN ANALYTICAL CHEMISTRY

154 ROC

Curves:

Derivation

To answer these questions, we f i r s t need a way to represent and deal with a l l these d i f f e r e n t possible operating points and t h e i r resultant performance c h a r a c t e r i s t i c s ( s e n s i t i v i t y / s p e c i f i c t y p a i r s ) . The ROC curve graphically displays the entire spectrum of a given test's performance for a p a r t i c u l a r sample group of affected and unaffected subjects. Figure 3 contains a hypothetical frequency d i s t r i b u t i o n histogram at the top and the and the corresponding ROC curve below. The ROC curve plots the true p o s i t i v e (TP) rate or percentage as a function of the f a l s e p o s i t i v e (FP) rate or percentage as the decision l e v e l i s varied. The true p o s i t i v e rate i s the same as s e n s i t i v i t y and i s equal to the number of affected individuals with a " p o s i t i v e " r e s u l t divided by the t o t a l number of affected i n d i v i d u a l s . The true p o s i t i v e rate i s also equal to 1 - f a l s e negative (FN) rate. The f a l s e p o s i t i v e rate i s the f r a c t i o n of unaffected individuals who nevertheless have a " p o s i t i v e " test r e s u l t , and i s therefore related to s p e c i f i c i t y , or the a b i l i t y of the test to c o r r e c t l y i d e n t i f y unaffected individuals ( s p e c i f i c i t y = true negative (TN) rate = number of unaffected individuals with "negative" r e s u l t s / t o t a l number of unaffected individuals = 1 - f a l s e p o s i t i v e r a t e ) . Both the TP and FP rates depend on the decision l e v e l chosen. Both rates also depend on the c l i n i c a l s e t t i n g , as r e f l e c t e d by the study population chosen. The FP rate i s influenced by the type of nondiseased subjects included i n the study group. I f , for example, the nondiseased subjects are a l l healthy blood donors who are free of any signs or symptoms of disease, the test may appear to have a much lower rate than i f the nondiseased subjects are persons who c l i n i c a l l y resemble those who a c t u a l l y have the disease. Like the FP rate, the TP rate also depends on the study group. A test used to detect cancer may have a higher TP rate when applied to patients who have active or advanced disease than when applied to patients having stable or limited disease. This dependence of TP and FP rates on the study population i s the reason why an ROC curve must be generated f o r each c l i n i c a l situation. Each point on the ROC curve represents a p a i r of true and f a l s e p o s i t i v e rates corresponding to some decision l e v e l . In Figure 3 , the l e f t hand curve of the frequency histogram (top) represents results from unaffected individuals and the right hand curve i s derived from affected individuals. The ROC curve i s derived from the data i n the frequency histogram, so the f i r s t step i s to obtain the t e s t results from both the affected group and the unaffected group. True p o s i t i v e rates are calculated using the results from the affected i n d i v i d u a l s , while f a l s e p o s i t i v e rates are generated from the unaffected individuals' data. The ROC curve i s constructed by varying the decision l e v e l from the highest test r e s u l t down to zero, r e s u l t i n g i n true and f a l s e p o s i t i v e rates which vary continuously. The decision l e v e l at point a i n Figure 3 i s higher than any observed results (see top), so at that decision l e v e l none of the r e s u l t s are " p o s i t i v e " and both true and f a l s e p o s i t i v e rates are zero (see bottom). As the

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

8. ZWEIG

Establishing Clinical Detection Limits of Laboratory Tests

unaffected

False Positive Rate (%)

Figure 3:

Top: Hypothetical frequency d i s t r i b u t i o n curve. Bottom: Receiver operating c h a r a c t e r i s t i c (ROC) curve corresponding to data i n top panel, generated by varying the decision l e v e l and then p l o t t i n g the r e s u l t i n g pairs of true and f a l s e p o s i t i v e rates. Arrows at a to e mark points corresponding to decision l e v e l s i n top panel. The curve from c to d d e s c r i b e s t h e test's performance i n the c r u c i a l overlap region.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

155

DETECTION IN ANALYTICAL CHEMISTRY

156

decision l e v e l i s lowered from a to b, some of the affected individuals have p o s i t i v e results but none of the unaffected individuals do, so the true p o s i t i v e rate r i s e s while the f a l s e p o s i t i v e rate remains zero. Point c shows the highest true p o s i t i v e rate achievable (with t h i s data) with the f a l s e p o s i t i v e rate s t i l l at zero. This i s the edge of the overlap region (c to d). At c the ROC curve leaves the Y axis because i f the decision l e v e l i s lowered any further, some unaffected individuals have f a l s e l y p o s i t i v e r e s u l t s . At decision l e v e l d, a l l affected individuals have p o s i t i v e test r e s u l t s , so the true p o s i t i v e rate reaches 100%, at the expense of some percentage of f a l s e p o s i t i v e s . This i s the other edge of the c r u c i a l overlap region. The portion of the curve from c to d (where i t has l e f t the Y axis but not yet intercepted the true p o s i t i v e = 100% horizontal l i n e ) describes the overlap region. From decision l e v e l d to e, f a l s e p o s i t i v e rates increase as more and more results from unaffected individuals are i n c o r r e c t l y c l a s s i f i e d as p o s i t i v e . ROC

Curves:

Interpretation

The complete ROC curve summarizes the c l i n i c a l accuracy of the test by displaying the paired true and f a l s e p o s i t i v e rates for a l l possible decision l e v e l s . Good c l i n i c a l performance of a test i s characterized by a high true p o s i t i v e rate and a low f a l s e p o s i t i v e rate. Accordingly, as test performance improves, the ROC curve w i l l move upward (toward higher true p o s i t i v e rates) and to the l e f t (toward lower f a l s e p o s i t i v e r a t e s ) . A perfect test would achieve a 100% true p o s i t i v e rate with no f a l s e p o s i t i v e s . Thus, i t s ROC curve would r i s e v e r t i c a l l y to the (0,100) point i n the upper l e f t comer and then move h o r i z o n t a l l y to the right along the horizontal l i n e representing true p o s i t i v e rate = 100% to the (100,100) point i n the upper right corner. Conversely, for a c l i n i c a l l y useless t e s t , which gives s i m i l a r results for subjects with and without the condition, the true and f a l s e p o s i t i v e rates would be i d e n t i c a l for any given decision l e v e l . Therefore, the ROC curve would be a diagonal between the lower l e f t and upper r i g h t corners, representing the l i n e where the true p o s i t i v e rate always equals the f a l s e p o s i t i v e rate. Because the curve i s usually above the diagonal, i t starts out at the lower l e f t with the TP rate ( s e n s i t i v i t y ) increasing faster than the f a l s e p o s i t i v e rate. At some point the slope begins to f a l l and the f a l s e p o s i t i v e rate starts increasing faster than the true p o s i t i v e rate - i n other words, gains i n s e n s i t i v i t y come at the cost of increasingly larger costs in terms of n o n s p e c i f i c i t y . This imposes a p r a c t i c a l l i m i t on the usable s e n s i t i v i t y of the test - where that l i m i t i s depends on the r e l a t i v e u t i l i t y or benefits and the costs of true and f a l s e r e s u l t s and gets us beyond detection and into decision issues. The ROC curve can also be constructed as a p l o t of true p o s i t i v e rate ( s e n s i t i v i t y ) versus true negative rate ( s p e c i f i c i t y ) instead of versus f a l s e p o s i t i v e rate

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

8.

ZWEIG

Establishing Clinical Detection Limits of Laboratory Tests

( 1 - s p e c i f i c i t y ) . T h i s produces a m i r r o r image of the c u r v e shown i n F i g u r e 3, f l i p p i n g t h e c u r v e t o the r i g h t s i d e w i t h the p e r f e c t p o i n t b e i n g the upper r i g h t hand c o r n e r i n s t e a d of the upper l e f t hand c o r n e r . The ROC c u r v e , t h e n , p r o v i d e s a comprehensive p i c t u r e of t h e t e s t ' s accuracy at a l l p o s s i b l e operating p o i n t s ( d e c i s i o n l e v e l s ) . I t does t h i s w i t h o u t the need t o choose a d e c i s i o n l e v e l o r e s t a b l i s h a normal range i n advance. Comparing T e s t s B e s i d e s b e i n g v a l u a b l e i n e v a l u a t i n g a s i n g l e t e s t by d e m o n s t r a t i n g the complete spectrum of i t s i n t r i n s i c performance, t h e ROC c u r v e i s e x t r e m e l y u s e f u l i n comparing t e s t s t o one a n o t h e r . Even i f we a r e e v a l u a t i n g o n l y a s i n g l e new t e s t , comparisons t o e x i s t i n g t e s t s are o f t e n i n h e r e n t i n the e v a l u a t i o n process. ROC c u r v e s p r o v i d e an e l e g a n t l y s i m p l e means f o r d e m o n s t r a t i n g the r e l a t i v e a c c u r a c y of m u l t i p l e t e s t s , comparing them a t e v e r y TP r a t e by p l o t t i n g the ROC c u r v e s f o r a l l the t e s t s on t h e same graph. I f the ROC c u r v e f o r one t e s t i s u n i f o r m l y above and t o the l e f t o f the ROC c u r v e f o r a second t e s t , t h e f i r s t t e s t w i l l have a lower FP r a t e t h a n t h e second t e s t has f o r any g i v e n TP r a t e . The ROC c u r v e s of F i g u r e 4 i l l u s t r a t e t h e a m b i g u i t y i n v o l v e d i n comparing t e s t s a t j u s t one d e c i s i o n l e v e l o r o p e r a t i n g p o i n t . C o n s i d e r the case i n w h i c h t e s t A has a TP r a t e of 98% and a FP r a t e of 30%, w h i l e t e s t Β has a TP r a t e of 70% and a FP r a t e of 2%. I f the c l i n i c a l performance of the two t e s t s were e q u i v a l e n t , t h e y would s h a r e a s i n g l e ROC c u r v e . This s i t u a t i o n i s i l l u s t r a t e d i n F i g u r e 4, l e f t . T e s t Β c o u l d have a c h i e v e d t h e same TP and FP r a t e s as t e s t A i f a d i f f e r e n t d e c i s i o n l e v e l had been used. I n f a c t e i t h e r t e s t c o u l d have a c h i e v e d any of t h e p a i r s of TP and FP r a t e s on the common ROC c u r v e s i m p l y by c h a n g i n g the d e c i s i o n l e v e l . Thus, the two t e s t s may i n f a c t s h a r e a s i n g l e ROC c u r v e but i n i t i a l l y appear t o p e r f o r m d i f f e r e n t l y because t h e two d e c i s i o n l e v e l s used p l a c e the t e s t s a t d i f f e r e n c e p o i n t s on the c u r v e , i . e . , t h e o p e r a t i n g c o n d i t i o n s were not comparable. On t h e o t h e r hand, t h e two t e s t s may a c t u a l l y perform very d i f f e r e n t l y , with t e s t Β c l e a r l y s u p e r i o r , as i l l u s t r a t e d i n F i g u r e 4, c e n t e r . R e g a r d l e s s of the d e c i s i o n l e v e l chosen f o r t e s t A, i t can not a c h i e v e a TP r a t e of 70% w i t h a FP r a t e of o n l y 2%, as d i d t e s t B. I n f a c t , when t e s t A*s TP r a t e i s 70%, i t s FP r a t e i s 10%. S i m i l a r l y , t h e t r u e - and f a l s e p o s i t i v e r a t e s g i v e n o r i g i n a l l y would be e q u a l l y c o n s i s t e n t w i t h t h e s i t u a t i o n shown i n F i g u r e 4, r i g h t , where t e s t A i s c l e a r l y s u p e r i o r . These examples i l l u s t r a t e how t h e use of ROC c u r v e s a v o i d s the a m b i g u i t y w h i c h may o c c u r when t e s t s a r e compared u s i n g o n l y one d e c i s i o n l e v e l f o r each.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

157

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

10

30

40

SO

60

70

tO

90

10

30

40

SO

60

70

tO

FALSE POSITIVE RATE IX)

20

90

100

10

20

30

to

10

30

40

SO

«0

70

10

FALSE POSITIVE RATE (X)

20

F i g u r e 4. H y p o t h e t i c a l r e c e i v e r o p e r a t i n g c h a r a c t e r i s t i c (ROC) c u r v e s showing t h r e e p o s s i b l e r e l a t i o n s between t e s t s A and B. I n each c a s e , t e s t A e x h i b i t s a t r u e p o s i t i v e r a t e o f 98% and a f a l s e p o s i t i v e r a t e o f 30%, w h i l e t e s t Β e x h i b i t s a t r u e p o s i t i v e r a t e o f 70% and a f a l s e p o s i t i v e r a t e o f 2%. L e f t p a n e l : Both t e s t s have i d e n t i c a l ROC c u r v e s , and t h u s , e q u i v a l e n t d i a g n o s t i c accuracy. Middle panel: Test Β has a b e t t e r ROC c u r v e . Right panel: Test A has a b e t t e r c u r v e .

FALSE POSITIVE RATE (Χ)

20

h-

°-

Ο

~~ Μ

100

•ο

*0

1

r

1

>

r

• >

Ν*

Ο

M H M

en oo

8. ZWEIG ROC Curves:

Establishing Clinical Detection Limits of Laboratory Tests Application

Figures 5 and 6 are examples of real ROC curves and i l l u s t r a t e how the ROC curve can represent individual t e s t accuracy as well as compare the accuracy of multiple tests to one another. Four analytes were measured. Creatine kinase (CK) i s a serum enzyme, found primarily i n heart and other muscles, which has been used f o r some years as an early marker for necrosis. Peak serum concentrations usually occur within the f i r s t 12-24 hours a f t e r the onset of i n f a r c t i o n . CK-MB i s an isoenzyme of CK which i s more s p e c i f i c f o r heart muscle than i s t o t a l CK and thus has become popular i n the l a s t 10 years. CK-BB, another isoenzyme of CK found i n the heart, has also been examined as a possible marker f o r myocardial i n f a r c t i o n . Myoglobin, a heme containing protein found i n muscle, i s released into the serum with muscle injury. Serum concentrations of myoglobin appear to r i s e e a r l i e r than CK i n patients with myocardial i n f a r c t i o n , peaking at about 8 hours a f t e r the onset of chest pain. Figure 5 was generated by studying these four markers of myocardial injury i n patients suspected of having a heart attack sampled 8 hours a f t e r the onset of chest pain. Myoglobin occupies the left-most p o s i t i o n of the tests, and achieves the best r a t i o of true positives to f a l s e positives, with good absolute s e n s i t i v i t y (high true p o s i t i v e rate) and s p e c i f i c i t y (low f a l s e p o s i t i v e rate) simultaneously. From the ROC curve, one can make two judgements. F i r s t , myoglobin achieves the best accuracy of the four tests. Second, myoglobin probably has potential as a early marker of myocardial i n f a r c t i o n because i t ' s ROC curve l i e s quite close to the i d e a l location, the upper l e f t hand corner. This indicates that i t can achieve high true p o s i t i v e and low f a l s e p o s i t i v e rates at the same time. How best to use this test c l i n i c a l l y and which decision l e v e l ( i . e . , where on the ROC curve to operate) to select requires c l i n i c a l decision analysis with consideration of the costs of f a l s e results, the alternative tests or procedures available, the costs of the alternatives, and the u t i l i t i e s of the various possible outcomes (2,3). The ROC curve displays the spectrum of s e n s i t i v i t y / s p e c i f i c i t y pairs achievable; these pairs are the raw data needed to make the selection of decision l e v e l . In Figure 6, the patients are sampled at 18 hours after the onset of chest pain. Myoglobin's accuracy has decreased while that of the three other tests has markedly increased to a close-to-perfect l e v e l . This r e f l e c t s the fact that the serum concentration of myoglobin i s not increased as much or as often at 18 hours compared to 8 hours after the onset of pain. Therefore, i t i s not as good at discriminating between patients having and not having an i n f a r c t i o n . CK and i t s isoenzymes, on the other hand, are near peak concentrations i n those patients with i n f a r c t s and lower i n those without i n f a r c t s , and thus are very accurate i n discriminating. In t h i s study, the "true" diagnosis or gold standard was established by review of electrocardiographic data, c l i n i c a l course, and serum lactate dehydrogenase isoenzymes, as

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

159

160

DETECTION IN ANALYTICAL CHEMISTRY

10

20

30

40

50

70

F A L S E P O S I T I V E RfiTE F i g u r e 5:

*0

90

100

(%)

ROC c u r v e s o f 4 serum t e s t s 8 hours a f t e r the onset of chest p a i n i n p a t i e n t s suspected of having a myocardial i n f a r c t i o n . CK = c r e a t i n e k i n a s e ; CK-BB = " b r a i n " isoenzyme o f CK; CK-MB = " m y o c a r d i a l " isoenzyme o f CK.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

8.

ZWEIG

Establishing Clinical Detection Limits of Laboratory Tests

100 τ . 90 so Lu H

ο

+

70

18

HOURS

60 50 40

UJ

30



20



•MYOGLOBIN • CK-BB • CK-nB TOTAL CK



10 0

•+-

10

20

•4-

•+-

30

40

-f50

FALSE POSITIVE F i g u r e 6:

-4-

60

70 RATE

SO

90

100

(%)

ROC c u r v e s o f 4 serum t e s t s 18 hours a f t e r the o n s e t o f c h e s t p a i n i n p a t i e n t s suspected of having a myocardial i n f a r c t i o n . A b b r e v i a t i o n s a r e same as f o r F i g u r e 5.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

161

162

DETECTION IN ANALYTICAL CHEMISTRY

well as scintigraphic findings where available. To avoid introduction of bias, the c l a s s i f i c a t i o n of patients was made without consideration of the results of any of the four tests being evaluated. A given study provides an estimate of the ROC curve for that t e s t and patient population. The confidence l i m i t s around the ROC curve can be calculated (8,9). Furthermore, the area under the ROC curve can be calculated f o r each test so as to derive a quantitative index of the test's individual accuracy and i t s r e l a t i o n to the other tests being evaluated (8,9). ROC curves can also be used to examine the impact of a n a l y t i c a l improvements on c l i n i c a l accuracy. Figure 7 shows d i s t r i b u t i o n s of test results and the corresponding ROC curves, based on simulated data. For both the affected and the unaffected patients, the b i o l o g i c a l v a r i a b i l i t y of the marker being measured has a standard deviation (SD) Of 1 unit. The affected patients have a mean test r e s u l t of 20, while the unaffected patients have a mean test result of 12. When the a n a l y t i c a l imprecision has an SD of 4 units, there i s considerable overlap i n test results between the affected and unaffected patients. The corresponding ROC curve shows the poor c l i n i c a l performance of the test. If an improved a n a l y t i c a l system reduces the imprecision of the measurement from 4 to 2 units, the overlap i n test results i s considerably reduced. The dramatic s h i f t of the ROC curve upward and to the l e f t r e f l e c t s the improved c l i n i c a l performance of the test. If the a n a l y t i c a l imprecision i f again halved, reducing i t s standard deviation from 2 to 1, another s i g n i f i c a n t improvement i n c l i n i c a l performance occurs. In this example, i n which the b i o l o g i c a l overlap between the two groups of patients was small, the precision of the a n a l y t i c a l system became the p r i n c i p a l factor i n determining the c l i n i c a l performance of the test; substantial improvements i n c l i n i c a l accuracy occurred as the a n a l y t i c a l p r e c i s i o n improved. In contrast, Figure 8 shows the s i t u a t i o n i n which the b i o l o g i c a l overlap i s greater. In t h i s example, the b i o l o g i c a l v a r i a t i o n i n each group has an SD of 4 u n i t s , r e s u l t i n g i n considerable i n t r i n s i c overlap i n the test results of the two groups. The figure shows t h i s extensive overlap and the poor ROC curve for an a n a l y t i c a l SD of 4. Decreasing the a n a l y t i c a l imprecision (from 4 to 2 to 1) provides only a minor improvement i n c l i n i c a l accuracy. Thus, when the b i o l o g i c a l overlap of the two groups i s large, even severalfold improvements i n a n a l y t i c a l p r e c i s i o n may have l i t t l e e f f e c t on the c l i n i c a l accuracy of the t e s t , as r e f l e c t e d i n the ROC curve. P r i n c i p l e s of Test Evaluation Once we have the basic performance data describing detection and the l i m i t s of detection as represented by the ROC curve, then we can go on to decision analysis. This involves structuring the c l i n i c a l problem i n the form of a decision tree, estimating u t i l i t i e s and costs of various outcomes, choosing decision levels

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

RESULT

SD=V SD-1

(L

Id >

Lu Lu

s

i 8

SD=2 SD-I

TEST

l

12

20 RESULT

16

2t

\

' \/ \

Λ

ANALYTICAL BIOLOGICAL

F i g u r e 7. Frequency d i s t r i b u t i o n and c o r r e s p o n d i n g ROC c u r v e s f o r a t e s t h a v i n g a b i o l o g i c a l SD o f 1 and a n a l y t i c a l SD o f 1, 2, o r 4. The two peaks i n t h e f r e q u e n c y d i s t r i b u t i o n s r e p r e s e n t t e s t r e s u l t s from d i s e a s e d and n o n d i s e a s e d p o p u l a t i o n s . C o n t i n u e d on n e x t page.

TEST

ANALYTICAL BIOLOGICAL

28

32

se

M

> r

η

> >

w H ο ζ

q

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987. SD=V

SD=2 SD-V

TEST RESULT

ANALYTICAL BIOLOGICAL

F i g u r e 8. Frequency d i s t r i b u t i o n and c o r r e s p o n d i n g ROC c u r v e s f o r a t e s t h a v i n g a b i o l o g i c a l SD o f 4. Other f e a t u r e s same as F i g u r e 7. Continued on next page.

TEST RESULT

ANALYTICAL BIOLOGICAL

H

• ο

• ζ •

Ζ

Ο ζ

9

1

M H M

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

8. ZWEIG

Establishing Clinical Detection Limits of Laboratory Tests

for using the t e s t . However, obtaining v a l i d data f o r the ROC curve i n the f i r s t place requires attention to several common-sense p r i n c i p l e s which are suprisingly often overlooked. Table I has a l i s t or recipe for designing a good study to evaluate the detection power of a test. The ideal study i s prospective and i s usually harder, longer and more expensive than the type of evaluation commonly done, but an "inexpensive'* c l i n i c a l evaluation may prove more costly i n the long run i f i t s erroneous conclusions lead to improper test u t i l i z a t i o n or improper patient management.

Table I. 1. 2. 3. 4. 5.

P r i n c i p l e s of a Good Evaluation of a Laboratory Test DEFINE CLINICAL QUESTION TEST WILL BE USED FOR SELECT APPROPRIATE SUBJECTS TO STUDY CLASSIFY SUBJECTS ACCURATELY AND INDEPENDENTLY PERFORM ALL TESTS ON ALL SUBJECTS EVALUATE AND COMPARE TESTS USING ROC CURVES

The f i r s t and most important element on this l i s t i s defining s p e c i f i c a l l y and c a r e f u l l y the c l i n i c a l question or problem at which the test i s to be directed. I t ' s not enough to say "Let's look at this test for p r o s t a t i c cancer or coronary artery disease and see how well i t does." We need to define precisely what question of relevance to patient management i s being addressed and how that test w i l l be used i n practice. Do we want to screen large numbers of people for cancer or use the test to e s t a b l i s h the stage of cancer once we know i t ' s there, or do we want to predict response to a p a r t i c u l a r therapy, or assess response to a p a r t i c u l a r therapy? I t may provide a l l these functions but with varying effectiveness and requiring d i f f e r i n g decision levels. Each of these roles must be evaluated separately because the populations are d i f f e r e n t , conditions are d i f f e r e n t , goals are d i f f e r e n t , and ROC curves may be d i f f e r e n t . If you think about these issues, c a r e f u l l y and s p e c i f i c a l l y defining what you are trying to establish, the rest starts f a l l i n g into place. The second element i s selecting appropriate subjects. Once you have defined the question, you've pointed the way toward the proper subjects. I f you want to use a tumor marker to i d e n t i f y colon cancer among middle aged people with bowel obstruction, occult blood loss, or unexplained anemia, then you need to look at the test performance i n that group of subjects. Healthy young people aren't relevant and neither i s a reference range based on them. There's no point i n doing conventional normal ranges i f healthy young volunteers aren't the ones f o r whom the test i s intended. Number three concerns establishing the true diagnosis: Once you've got a group of people with bowel signs or symptoms suggestive that cancer i s possible, then you must separate them into 2 groups, those who r e a l l y do have carcinoma of the colon and those who don't. This provides a gold standard f o r calculation of

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

167

168

DETECTION IN ANALYTICAL CHEMISTRY

TP r a t e s , FP r a t e s , e t c . T h i s d i a g n o s i s needs t o be a c c u r a t e as w e l l as independent of a l l t e s t s b e i n g e v a l u a t e d . To t h e e x t e n t t h a t e i t h e r a c c u r a c y o r independence i s l a c k i n g , t h e r e s u l t s o f t h e e v a l u a t i o n w i l l be b i a s e d and m i s l e a d i n g . C o n s i d e r the h y p o t h e t i c a l s i t u a t i o n i n F i g u r e 9. The c l i n i c a l q u e s t i o n i s , "Has t h i s p a t i e n t p r e s e n t i n g a t the emergency room w i t h an a c u t e p s y c h i a t r i c d i s o r d e r used m a r i j u a n a r e c e n t l y ? " The r o u t i n e t e s t i s s e n s i t i v e enough t o d e t e c t o n l y 70% of t h e r e c e n t drug u s e r s ; 30% of t h e m a r i j u a n a u s e r s have f a l s e l y n e g a t i v e r e s u l t s . The r o u t i n e t e s t a l s o s u f f e r s from v a r i o u s i n t e r f e r e n c e s , l e a d i n g t o f a l s e p o s i t i v e r e s u l t s i n 30% of n o n - u s e r s . T e s t I r e p r e s e n t s a new t e s t which i s b e i n g e v a l u a t e d . I n a c t u a l i t y i t m a n i f e s t s e x c e l l e n t s e n s i t i v i t y and s p e c i f i c i t y , g i v i n g p o s i t i v e r e s u l t s i n a l l r e c e n t m a r i j u a n a u s e r s and n e g a t i v e r e s u l t s i n a l l n o n - u s e r s . I f , however, i n s t e a d of i n d e p e n d e n t l y and a c c u r a t e l y d e t e r m i n i n g t h e drug-use s t a t u s of each p a t i e n t , t h e p a t i e n t s a r e s i m p l y c l a s s e d as u s e r s o r non-users on the b a s i s of t h e r o u t i n e t e s t ' s r e s u l t s , T e s t I w i l l appear t o p e r f o r m p o o r l y , m i s c l a s s i f y i n g 30% o f t h e p a t i e n t s . I n t h i s case a p e r f e c t t e s t appears t o p e r f o r m p o o r l y s i m p l y because t h e c l i n i c a l q u e s t i o n was n o t answered a c c u r a t e l y f o r each p a t i e n t ; i . e . , the " g o l d s t a n d a r d " used f o r comparison was inadequate. The o p p o s i t e b i a s can a l s o r e s u l t from use of inadequate g o l d standards. T e s t I I i n F i g u r e 9 performs even more p o o r l y t h a n t h e r o u t i n e t e s t , y i e l d i n g f a l s e n e g a t i v e r e s u l t s i n 40% of the m a r i j u a n a u s e r s and f a l s e p o s i t i v e r e s u l t s i n 40% of the n o n - u s e r s . I f , however, t h e r o u t i n e t e s t ' s r e s u l t s a r e a c c e p t e d as c o r r e c t and T e s t I I i s judged on t h i s b a s i s , T e s t I I w i l l appear t o m i s c l a s s i f y o n l y 10% of the p a t i e n t s — and w i l l have a b e t t e r apparent performance than T e s t I ! T h i s can o c c u r i n s e v e r a l ways i n c l i n i c a l p r a c t i c e . In e v a l u a t i n g a t e s t f o r a c u t e m y o c a r d i a l i n f a r c t i o n , i f the p a t i e n t s a r e c l a s s i f i e d on t h e b a s i s of EKG d a t a a l o n e o r even a c o m b i n a t i o n of h i s t o r y , EKG f i n d i n g s and some c a r d i a c enzyme r e s u l t s (a " r o u t i n e workup"), t h e d i a g n o s i s may s t i l l be i n a c c u r a t e and, t h u s , d i s t o r t the apparent performance of t h e new t e s t . I n t h e case of a cancer tumor maker, i f the g o l d s t a n d a r d ( d i a g n o s i s o r s t a g i n g , e t c . ) i s based upon c l i n i c a l f i n d i n g s r a t h e r than s u r g i c a l and/or t i s s u e d a t a , t h e n t h e g o l d s t a n d a r d may be i n a c c u r a t e and b i a s t h e apparent v a l u e of t h e marker. I f an a m n i o t i c f l u i d marker f o r f e t a l l u n g m a t u r i t y i s compared t o an e x i s t i n g i m p e r f e c t marker, then even i f t h e new marker i s p e r f e c t , i t w i l l appear i m p e r f e c t . The g o l d s t a n d a r d a g a i n s t which t h e new marker s h o u l d be compared i s the a c t u a l p r e s e n c e o r absence of r e s p i r a t o r y d i s t r e s s syndrome i n those newborns d e l i v e r e d w i t h i n a s h o r t time of measurement of the marker. Because t h e v a l i d i t y of a c l i n i c a l e v a l u a t i o n ' s c o n c l u s i o n s i s c r i t i c a l l y dependent on t h e a c c u r a t e d e t e r m i n a t i o n of t h e answer t o t h e c l i n i c a l q u e s t i o n f o r each s u b j e c t , r o u t i n e c l i n i c a l diagnoses are l i k e l y t o be inadequate f o r t e s t e v a l u a t i o n s t u d i e s . D e f i n i t i v e d e t e r m i n a t i o n of a p a t i e n t ' s t r u e c l i n i c a l subgroup may r e q u i r e such p r o c e d u r e s as b i o p s y , s u r g i c a l e x p l o r a t i o n , a u t o p s y e x a m i n a t i o n , a n g i o g r a p h y , o r l o n g term f o l l o w - u p of response t o t h e r a p y and c l i n i c a l outcome.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

ZWEIG

Establishing Clinical Detection Limits of Laboratory Tests

Patients Who Have Used Marijuana Recently

100 §

I

Routine Test

Test I

- FN

= TP:

60 40 ~ TP

Test II

Patients Who Have Not Used Marijuana Recently

100

Routine Test FP

FN

80

TP

40 " TN

Test 1

II0H11

Test II FP

60

20

TN

20 η

0 FN = False Negative Results TP = True Positive Results F i g u r e 9:

FP = False Positive Results TN = True Negative Results

H y p o t h e t i c a l performances o f t h r e e t e s t s f o r m a r i j u a n a u s e i n two subgroups o f p a t i e n t s , one w h i c h has used m a r i j u a n a r e c e n t l y and one w h i c h has n o t . Assumes t h a t t h e r o u t i n e t e s t g i v e s c o r r e c t r e s u l t s i n 70% o f subjects.

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.

DETECTION IN ANALYTICAL CHEMISTRY

170

The next item i s performing a l l tests being evaluated on a l l the subjects being used. This may sound reasonable but i s very often overlooked. I f the specimens or subjects aren't i d e n t i c a l for a l l tests being examined, observed differences i n test performance could simply be r e f l e c t i o n s of differences i n the subjects rather than true differences i n performance. The last element i s evaluating and comparing tests using ROC curves, extensively discussed above. The ROC analysis i s a powerful tool which provides a pure index of accuracy, of discrimination c a p a b i l i t y , c l e a r l y describing the limits of c l i n i c a l detection possible for a given test i n a given c l i n i c a l setting. Adherence to the recipe i n Table I, including ROC analysis, should maximize the likelihood of obtaining a v a l i d assessment of laboratory test accuracy.

Literature Cited 1. Swets, J.Α., and Pickett, R.M. Evaluation of Diagnostic Systems. Methods from Signal Detection Theory; Academic Press: New York, 1982; Chapter 1. 2. McNeil, B.J.; Keeler, E.; Adelstein, S.J. N. Engl. J. Med. 1975, 293, 211-215. 3. Weinstein, C.; Feinberg, H.V. Clinical Decision Analysis; W.B. Saunders Co.: Philadelphia, 1980. 4. Zweig, M.H.; Robertson, E.A. Clin. Chem. 1982, 28, 1272-1276. 5. Robertson, Ε.Α.; Zweig, M.Η.; Van Steirteghem, A.C. Clin. Pathol. 1983, 79, 78-86.

Amer. J.

6. Metz, C.E. Semin. Nucl. Med. 1978; 8, 283-298. 7. Turner, D.A. J. Nucl. Med. 1978, 19, 213-220. 8. Beck, J.R.; Shultz, E.K. 13-20.

Arch. Pathol. Lab. Med. 1986, 110,

9. McNeil, B.J.; Hanley, J.A. Med. Dec. Making. 1984, 4, 137-150. RECEIVED December 24, 1986

Currie; Detection in Analytical Chemistry ACS Symposium Series; American Chemical Society: Washington, DC, 1987.