|
ROC
Analysis
ROC
methodology is appropriate in situations where there are 2
possible "truth states" (i.e., diseased/normal,
event/non-event, or some other binary outcome), "truth"
is known for each case, and "truth" is determined
independently of the diagnostic tests / predictor variables
/ etc. under study.
In
this subdirectory, you will find a number of programs (mostly
in FORTRAN) used in ROC analysis. They are briefly described
below, along with general guidelines to help you decide which
program is most appropriate for your data. First, some basic
terminology to help you make your decision (note that "disease"
can be replaced with "condition" or "event"):
Rating data vs Continuous data
The term "rating data" is used to describe data
based on an ordinal scale. For example, it is common in radiology
studies to use a 5-point scale such as 1=disease definitely
absent, 2=disease probably absent, 3=disease possibly present,
4=disease probably present, 5=disease definitely present.
"Continuous data" refers to either truly continuous
measurements or "percent confidence" scores (0-100).
Interpreting the Area Under the ROC Curve (AUC)
The area under the ROC curve (AUC) is commonly used as a summary
measure of diagnostic accuracy. It can take values from 0.0
to 1.0. The AUC can be interpreted as the probability that
a randomly selected diseased case (or "event") will
be regarded with greater suspicion (in terms of its rating
or continuous measurement) than a randomly selected nondiseased
case (or "non-event"). So, for example, in a study
involving rating data, an AUC of 0.84 implies that there is
an 84% likelihood that a randomly selected diseased case will
receive a more-suspicious (higher) rating than a randomly
selected nondiseased case. Note that an AUC of 0.50 means
that the diagnostic accuracy in question is equivalent to
that which would be obtained by flipping a coin (i.e., random
chance). It is possible but not common to run into AUCs less
than 0.50. It is often informative to report a 95% confidence
interval for a single AUC in order to determine whether the
lower endpoint is > 0.50 (i.e., whether the diagnostic
accuracy in question is, with some certainty, any better than
random chance).
Designing an ROC study: Which scale to use?
While ordinal (1-5) rating scales are probably the most widely
used in radiology studies, there are advantages to using "percent
confidence" (0-100) scales. (Of course, if you are dealing
with a continuous measurement, you don't have to worry about
which scale to use.) For continuous data, nonparametric methods
are quite reasonable. With rating data, parametric methods
are recommended, as nonparametric methods will be biased (i.e.,
tend to underestimate the true AUC). The standard error of
the estimated area under the ROC curve is smaller using a
continuous scale.
Parametric vs Nonparametric methodology
"Parametric" methodology refers to inference (MLEs)
based on the bivariate normal distribution (i.e., this estimate
assumes one normal distribution for cases with the disease
and one normal distribution for cases without, or that the
data has been monotonically transformed to normal). When this
assumption is true, the MLE is unbiased.
"Nonparametric"
refers to inference based on the trapezoidal rule (which is
equal to the Wilcoxon estimate of the area under the ROC curve,
which in turn is equal to the "c"-statistic in SAS
PROC LOGISTIC output). Nonparametric estimates of the area
under the ROC curve (AUC) tend to underestimate the "smooth
curve" area (i.e., parametric estimates), but this bias
is negligible for continuous data.
Recommendations
For rating data, try a parametric method first. The bias inherent
in the nonparametric method might be problematic. If the data
are sparse (i.e., nondiseased patients and diseased patients
tend to be rated at opposite ends of the scale), then parametric
methods may not work well. Using the nonparametric approach
is an option in these cases, but may provide even more biased
results than it normally would.
For
continuous data, either the parametric or nonparametric approach
is fine.
Correlated data
"Correlated data" refers to multiple observations
obtained from the same "region of interest" (ROI).
For example, in a study of appendicitis screening, each patient
may be imaged by two different "modalities" (i.e.,
plain X-ray film versus digitized images). So, then, there
will be 2 different images (plain film vs digitized image)
of Patient X's appendix, and each image will be assigned a
separate rating. Therefore, a single ROI (Patient X's appendix)
yields 2 observations (1 from each imaging modality). When
comparing the accuracy (AUC) of plain film to that of digitized
imaging, we must take into account the fact that the 2 AUCs
are correlated because they are based on the same sample of
cases.
Clustered data
"Clustered data" refers to situations in which (one
or more) patients have two or more "regions of interest"
(ROIs), each of which contributes a separate measurement.
For example, in a brain study, measurements may be obtained
from the left and right hemispheres in each patient. In a
mammography study, each breast image may be subdivided into
5 ROIs, and thus 10 separate ratings may be obtained for each
patient. In such cases, it is important to account for intrapatient
correlation between measurements obtained on ROIs within the
same patient.
Of
course, it is possible to have data that is both clustered
and correlated. For example, in the mammography example mentioned
above, if each breast is imaged in 2 modalities, then there
are 10 ROIs per patient, and 2 ratings per ROI. A comparison
of the accuracies (AUCs) between the two modalities will have
to take into account (a) the fact that the 10 ratings from
each patient (for each modality) are correlated, and (b) the
fact that the 2 ratings from the 2 modalities (for each ROI)
are correlated.
One reader versus multiple readers
The preceding discussion assumes either (a) that the underlying
variable is a measurement or rating obtained from only a single
source (i.e., one reader) or (b) that AUCs will only be reported
*separately* for each reader. If it is desired that the AUCs
of multiple readers (i.e., 3 doctors independently assigning
ratings) be averaged in order to arrive at one overall "average
AUC" per modality, then special methods must be used
to handle inter-rater correlation (correlation between the
ratings of different readers due to the fact that the same
set of cases is rated by each reader).
Partial ROC area
In some cases, rather than looking at the area under the entire
ROC curve, it is more helpful to look at the area under only
a portion of the curve- for example, within a certain range
of false-positive (FP) rates (i.e., restricted to a portion
of the X-axis). Often, interest does not lie in the entire
range of FP rates, and consequently, only part of the area
under the curve is relevant. For example, if we know ahead
of time that a particular diagnostic test would not be useful
if its FP rate is greater than 0.25, we might want to restrict
our attention to that portion of the ROC curve where FP rates
are less than or equal to 0.25.
Another
possible reason to analyze the partial AUC rather than the
entire AUC was discussed by Dwyer (Radiology 1997;202:621-625):
"A
major drawback to the area below the entire ROC plot as
an index of performance is its global nature. Its summarization
of the entire ROC plot fails to consider the plot as a composite
of different segments with different diagnostic implications.
ROC plots that cross may have similar total areas but differ
in their diagnostic efficacy in specific diagnostic situations.
Prominent differences between ROC plots in specific regions
may be muted or reversed when the total area is considered.
Moreover, plots with different total areas may be similar
in specific regions...A solution to the problems of a global
assessment of the entire ROC plot is the assessment of specific
regions...on the ROC plot."
Which
program should you use?
-
For ROC sample size calculations:
ROCPOWER.SAS (1 reader;
1 or 2 ROC curves)
(for help, look at ROCPOWER_HELP.TXT
and/or ROCPOWER_DOC.WPD)
DESIGNROC.FOR
(for partial ROC area or SENS at fixed FPR)
(for help, look at DESIGNROC_HELP.TXT)
MULTIREADER_POWER.SAS
(if multiple readers)
-
For
plotting ROC curves using S-PLUS:
rocPlot.s (written by Hemant Ishwaran, PhD, Cleveland
Clinic) (for this program, please refer to: http://www.bio.ri.ccf.org/Resume/Pages/Ishwaran/rocPlot.s)
For
plotting ROC curves using SAS/Graph:
CREATE_ROC.SAS (for
help, look at CREATEROC_HELP.TXT)
-
For
inference on *partial* ROC area:
PARTAREA.FOR (for help,
look at PARTAREA_HELP.TXT)
PARTAREA.FOR
uses a parametric approach to estimate partial AUC. Margaret
Pepe has developed Stata software to implement a nonparametric
method of estimating partial (or full) AUC. This program
can be obtained from:
http://www.fhcrc.org/labs/pepe/book/
(Click
on the Programs link and download: aucbs.ado and aucbs.hlp)
Reference:
Dodd LE, Pepe MS. Partial AUC estimation and regression.
Biometrics. 59(3):614-23, 2003 Sep.
-
For
inference on the area(s) under one or more ROC curves:
(SINGLE READER; 1-2 "MODALITIES")
-
-
Is data clustered?
If Y ===> Correlated data?
If N ===> If rating data, skip to (B)
If continuous data, skip to (C)
-
(For rating data) Would you prefer a parametric or
nonparametric method?
If parametric: Is data correlated?
Y ===> CORROC2.F
N ===> ROCFIT.F (for
1 curve) or INDROC.F
(for 2 curves)
If nonparametric: Is data correlated?
Y ===> DELONG.FOR
N ===> DELONG.FOR
NOTE:
I have also included in this subdirectory a SAS program,
ROC_MASTERPIECE.SAS,
which computes nonparametric estimates of ROC area
for 2 or more (possibly) correlated modalities and
performs a global chi-square test comparing all the
AUCs.
-
(For
continuous data) Would you prefer a parametric or
nonparametric method?
If parametric: Is data correlated?
Y ===> CLABROC
N ===> LABROC4
If nonparametric: Is data correlated?
Y ===> DELONG.FOR
N ===> DELONG.FOR
NOTE:
I have also included in this subdirectory a SAS program,
ROC_MASTERPIECE.SAS,
which computes nonparametric estimates of ROC area
for 2 or more (possibly) correlated modalities and
performs a global chi-square test comparing all the
AUCs.
- For
inference on the area(s) under one or more ROC curves:
(MULTIPLE READERS and/or OTHER COVARIATES)
(for rating or continuous
data) ===> OBUMRM.FOR
(for help, see INSTRUCTIONS.TXT; also see 3 sets of example
input and output files: FORMACONT.DAT, FORMACONT_OUT.DAT,
FORMAORDIN.DAT, FORMAORDIN_OUT.DAT, FORMB.DAT, and FORMB_OUT.DAT;
also on the Web at OBUMRM.html)
This is a FORTRAN program
written by Nancy Obuchowski which implements the Obuchowski-Rockette
method for analyzing multireader and multimodality ROC
data. This program does not handle clustered data (i.e.,
multiple observations per patient). The program produces
nonparametric estimates of ROC area, but allows for
the comparison of user-input parametric AUCs, partial
AUCs, or sensitivities at a fixed FPR (see "Format B"
for examples).
(for rating or continuous data) ===> MULTIVARIATEROC.S
This is an S-PLUS program written by Hemant Ishwaran which
uses the nonparametric method of Delong et al (Biometrics,
1988) and allows for multiple observations per patient
(i.e. ratings from multiple readers). It does not allow
for multiple readers *and* multiple modalities. This program
does not handle missing data. This program does not come
up with an overall average ROC area among readers; it
does provide a global chi-square test comparing the ROC
areas of the readers and allows for linear contrasts of
selected readers (i.e., '1, 2, and 3 vs 4 and 5'; '4 vs
5'). To access this program, go to: multivariateRoc.s
(for rating or continuous data) ===> LABMRMC
This program uses the Dorfman-Berbaum-Metz (DBM) algorithm
to compare multiple readers and multiple treatments. The
program uses jackknifing and ANOVA techniques to test
the statistical significance of the differences between
treatments and between readers. LABMRMC can be downloaded
from the website of the Kurt Rossman Laboratories at the
Univ. of Chicago at: http://xray.bsd.uchicago.edu/krl/roc_soft.htm
LABMRMC is suitable for use with subjective probability-rating
scales (0-100); outcomes for each reader-treatment combination
are categorized in an attempt to produce an appropriate
spread of operating points on each ROC curve, followed
by application of the DBM procedure. (Note that all of
Dr. Metz's ROC software is available in three different
formats: IBM-compatible, Mac-compatible, and generic text-only
source code for UNIX/VMS environments.) (Documentation
for LABMRMC can be obtained from Dr. Metz's website or
in this directory in the file LABMRMC.DOC)
(for rating data) ===> MRMC
The MRMC (Dorfman, Berbaum,
Abu-Dagga, and Schartz, ftp://perception.radiology.uiowa.edu/mrmc32/)
software implements the procedure for discrete rating
data (up to 20 rating categories) as described in Dorfman,
Berbaum, and Metz (1992). This program uses the Dorfman-Berbaum-Metz
algorithm to compare multiple readers and multiple treatments.
LINKS
TO ADDITIONAL ROC SOFTWARE (degenerate data; Bayesian ROC
methods; LROC analysis; verification bias; etc.)
- http://xray.bsd.uchicago.edu/krl/roc_soft.htm
===> the ROC site of the Kurt Rossman Laboratories at
the Univ. of Chicago; download ROCKIT, LABMRMC, PlotROC.xls,
ROCFIT, LABROC1, CORROC2, CLABROC, INDROC, ROCPWRPC)
- Pan
and Metz have developed a program, PROPROC, for "hooked"
data (i.e., the ROC curve crosses below the diagonal "chance
line") and/or degenerate data. They offer the following
definition (Pan X and Metz CE. The "proper" binormal model:
Parametric ROC curve estimation with degenerate data. Academic
Radiology 1997;4:380-389):
"ROC
data sets are said to be 'degenerate' when they can be
fit exactly by a conventional binormal ROC curve that
consists of only horizontal and vertical line segments.
Data degeneracy can be quite common in circumstances where
data sets include only a small number of cases and/or
the data are obtained on a discrete ordinal scale in which
the categories are few and poorly distributed...
In general, a degenerate
data set involves empty cells in the data matrix that
represents the outcome of an ROC experiment that employs
a discrete (e.g., five-category) confidence-rating scale,
and particular patterns of these empty data cells inherently
cause iterative maximization procedures...to fail to
converge... . Dorfman and Berbaum (see below) developed
an approach and a corresponding computer program, RSCORE4,
that is able to estimate ROC curves for degenerate data
sets. Their approach is still based on the conventional
binormal ROC model, but it eliminates degeneracy by
assigning small positive values to all empty cells.
... We have proposed an alternative approach to the
problem of data degeneracy that employs a "proper" binormal
model, and we have developed a corresponding computer
program, PROPROC, for maximum-likelihood estimation
of proper binormal ROC curves."
PROPROC is a parametric approach, developed for discrete
categorical data but also applicable to continuous data.
It is a zip file located in: ftp://minira.bsd.uchicago.edu/roc/ibmpc/
There is no "generic" version of the code. To obtain a UNIX-compatible
version, rewrite the file IOFILE.F so that it does not use
the WINDOWS API.
- ftp://perception.radiology.uiowa.edu/
===> Kevin Berbaum's ROC software site; download MRMC;
includes RSCORE for degenerate rating data (see "RSCORE
4.66 User's Manual.pdf" in this directory, or go to Berbaum's
site for further information); also includes BIGAMMA for
degenerate rating data (both RSCORE and BIGAMMA are parametric
and have the same data-input format; Berbaum recommends
trying RSCORE before BIGAMMA, and if the ROC curves don't
make sense or are difficult to interpret- i.e., cross the
"chance line"- then try BIGAMMA) (also see PROPROC, above,
for degenerate data)
- http://www.radiology.arizona.edu/krupinski/mips/rocprog.html
===> Univ. of Arizona Radiology Department website; download
most of the Metz and Dorfman/Berbaum software discussed
previously; also download Richard Swensson's LROC, which
takes target location into account in analyzing ROC data
- Those
interested in a Bayesian approach to ROC analysis are directed
to: http://www-math.bgsu.edu/~albert/ord_book/Chapter5/
This directory contains MATLAB software and an example ("Example
2") from Chapter 5 of Ordinal Data Modeling, by Johnson
and Albert, Springer (NY), 2001
- Andrew
Zhou has developed software to deal with verification bias
in ROC analyses. Andrew's COMPROC program is not currently
available online, but his email address (as of June 2002)
for those interested in this software is: Andrew.Zhou@med.va.gov
Verification bias is a phenomenon that occurs when inclusion
in an ROC study depends on the availability of a gold standard
which is only obtained on a particular nonrandom subset
of patients in the study population (i.e., only women with
"suspicious" mammograms get biopsied, and biopsy is the
gold standard).
- Nancy
Obuchowski of the Cleveland Clinic has developed FORTRAN
software to provide an ROC-type summary measure of accuracy
when the gold standard is ordinal or continuous, rather
than dichotomous. (See: Obuchowski NA. An ROC-Type Measure
of Accuracy When the Gold Standard is Rank- or Continuous-
Scale. Submitted for publication.)
Artificially dichotomizing ordinal or continuous gold standards
in order to fit them into a traditional ROC analysis leads
to bias and inconsistency in estimates of diagnostic accuracy.
Obuchowski provides nonparametric estimators of accuracy
which have interpretations analogous to the AUC, and permit
the use of ordinal ( ROC_ORDGOLD.for)
or continuous (ROC_CONTGS.for)
gold standards.
- Two ROC-analysis software packages that may be of particular interest to clinical chemists and persons in related fields are MedCalc
(www.MedCalc.be) and EP Evaluator (developed by David Rhoads Associates, www.dgrhoads.com). Both packages are available by subscription only (for a fee).
EP Evaluator has a particularly useful functionality in which the user has the ability to move a cursor along the ROC curve and view sensitivity and specificity at each cutoff value.
|