|
ROC
Analysis
ROC
methodology is appropriate in situations where there are 2 possible
"truth states" (i.e., diseased/normal, event/non-event,
or some other binary outcome), "truth" is known for each
case, and "truth" is determined independently of the diagnostic
tests / predictor variables / etc. under study.
In
this subdirectory, you will find a number of programs (mostly in
FORTRAN) used in ROC analysis. They are briefly described below,
along with general guidelines to help you decide which program is
most appropriate for your data. First, some basic terminology to
help you make your decision (note that "disease" can be
replaced with "condition" or "event"):
Rating data vs Continuous data
The term "rating data" is used to describe data based
on an ordinal scale. For example, it is common in radiology studies
to use a 5-point scale such as 1=disease definitely absent, 2=disease
probably absent, 3=disease possibly present, 4=disease probably
present, 5=disease definitely present. "Continuous data"
refers to either truly continuous measurements or "percent
confidence" scores (0-100).
Interpreting the Area Under the ROC Curve (AUC)
The area under the ROC curve (AUC) is commonly used as a summary
measure of diagnostic accuracy. It can take values from 0.0 to 1.0.
The AUC can be interpreted as the probability that a randomly selected
diseased case (or "event") will be regarded with greater
suspicion (in terms of its rating or continuous measurement) than
a randomly selected nondiseased case (or "non-event").
So, for example, in a study involving rating data, an AUC of 0.84
implies that there is an 84% likelihood that a randomly selected
diseased case will receive a more-suspicious (higher) rating than
a randomly selected nondiseased case. Note that an AUC of 0.50 means
that the diagnostic accuracy in question is equivalent to that which
would be obtained by flipping a coin (i.e., random chance). It is
possible but not common to run into AUCs less than 0.50. It is often
informative to report a 95% confidence interval for a single AUC
in order to determine whether the lower endpoint is > 0.50 (i.e.,
whether the diagnostic accuracy in question is, with some certainty,
any better than random chance).
Designing an ROC study: Which scale to use?
While ordinal (1-5) rating scales are probably the most widely used
in radiology studies, there are advantages to using "percent
confidence" (0-100) scales. (Of course, if you are dealing
with a continuous measurement, you don't have to worry about which
scale to use.) For continuous data, nonparametric methods are quite
reasonable. With rating data, parametric methods are recommended,
as nonparametric methods will be biased (i.e., tend to underestimate
the true AUC). The standard error of the estimated area under the
ROC curve is smaller using a continuous scale.
Parametric vs Nonparametric methodology
"Parametric" methodology refers to inference (MLEs) based
on the bivariate normal distribution (i.e., this estimate assumes
one normal distribution for cases with the disease and one normal
distribution for cases without, or that the data has been monotonically
transformed to normal). When this assumption is true, the MLE is
unbiased.
"Nonparametric"
refers to inference based on the trapezoidal rule (which is equal
to the Wilcoxon estimate of the area under the ROC curve, which
in turn is equal to the "c"-statistic in SAS PROC LOGISTIC
output). Nonparametric estimates of the area under the ROC curve
(AUC) tend to underestimate the "smooth curve" area (i.e.,
parametric estimates), but this bias is negligible for continuous
data.
Recommendations
For rating data, try a parametric method first. The bias inherent
in the nonparametric method might be problematic. If the data are
sparse (i.e., nondiseased patients and diseased patients tend to
be rated at opposite ends of the scale), then parametric methods
may not work well. Using the nonparametric approach is an option
in these cases, but may provide even more biased results than it
normally would.
For
continuous data, either the parametric or nonparametric approach
is fine.
Correlated data
"Correlated data" refers to multiple observations obtained
from the same "region of interest" (ROI). For example,
in a study of appendicitis screening, each patient may be imaged
by two different "modalities" (i.e., plain X-ray film
versus digitized images). So, then, there will be 2 different images
(plain film vs digitized image) of Patient X's appendix, and each
image will be assigned a separate rating. Therefore, a single ROI
(Patient X's appendix) yields 2 observations (1 from each imaging
modality). When comparing the accuracy (AUC) of plain film to that
of digitized imaging, we must take into account the fact that the
2 AUCs are correlated because they are based on the same sample
of cases.
Clustered data
"Clustered data" refers to situations in which (one or
more) patients have two or more "regions of interest"
(ROIs), each of which contributes a separate measurement. For example,
in a brain study, measurements may be obtained from the left and
right hemispheres in each patient. In a mammography study, each
breast image may be subdivided into 5 ROIs, and thus 10 separate
ratings may be obtained for each patient. In such cases, it is important
to account for intrapatient correlation between measurements obtained
on ROIs within the same patient.
Of
course, it is possible to have data that is both clustered and correlated.
For example, in the mammography example mentioned above, if each
breast is imaged in 2 modalities, then there are 10 ROIs per patient,
and 2 ratings per ROI. A comparison of the accuracies (AUCs) between
the two modalities will have to take into account (a) the fact that
the 10 ratings from each patient (for each modality) are correlated,
and (b) the fact that the 2 ratings from the 2 modalities (for each
ROI) are correlated.
One reader versus multiple readers
The preceding discussion assumes either (a) that the underlying
variable is a measurement or rating obtained from only a single
source (i.e., one reader) or (b) that AUCs will only be reported
*separately* for each reader. If it is desired that the AUCs of
multiple readers (i.e., 3 doctors independently assigning ratings)
be averaged in order to arrive at one overall "average AUC"
per modality, then special methods must be used to handle inter-rater
correlation (correlation between the ratings of different readers
due to the fact that the same set of cases is rated by each reader).
Partial ROC area
In some cases, rather than looking at the area under the entire
ROC curve, it is more helpful to look at the area under only a portion
of the curve- for example, within a certain range of false-positive
(FP) rates (i.e., restricted to a portion of the X-axis). Often,
interest does not lie in the entire range of FP rates, and consequently,
only part of the area under the curve is relevant. For example,
if we know ahead of time that a particular diagnostic test would
not be useful if its FP rate is greater than 0.25, we might want
to restrict our attention to that portion of the ROC curve where
FP rates are less than or equal to 0.25.
Another
possible reason to analyze the partial AUC rather than the entire
AUC was discussed by Dwyer (Radiology 1997;202:621-625):
"A
major drawback to the area below the entire ROC plot as an index
of performance is its global nature. Its summarization of the
entire ROC plot fails to consider the plot as a composite of different
segments with different diagnostic implications. ROC plots that
cross may have similar total areas but differ in their diagnostic
efficacy in specific diagnostic situations. Prominent differences
between ROC plots in specific regions may be muted or reversed
when the total area is considered. Moreover, plots with different
total areas may be similar in specific regions...A solution to
the problems of a global assessment of the entire ROC plot is
the assessment of specific regions...on the ROC plot."
Which
program should you use?
-
For ROC sample size calculations:
ROCPOWER.SAS (1 reader; 1 or 2 ROC curves)
(for help, look at ROCPOWER_HELP.TXT and/or ROCPOWER_DOC.WPD)
DESIGNROC.FOR
(for partial ROC area or SENS at fixed FPR)
(for help, look at DESIGNROC_HELP.TXT)
MULTIREADER_POWER.SAS
(if multiple readers)
(for help, look at MULTIREADER_HELP.TXT)
-
For
plotting ROC curves using S-PLUS:
rocPlot.s (written by Hemant Ishwaran, PhD, Cleveland Clinic)
(for this program, please refer to: http://www.bio.ri.ccf.org/Resume/Pages/Ishwaran/rocPlot.s)
For
plotting ROC curves using SAS/Graph:
CREATE_ROC.SAS (for help, look at CREATEROC_HELP.TXT)
-
For
inference on *partial* ROC area:
PARTAREA.FOR (for help, look at PARTAREA_HELP.TXT)
PARTAREA.FOR
uses a parametric approach to estimate partial AUC. Margaret
Pepe has developed Stata software to implement a nonparametric
method of estimating partial (or full) AUC. This program can
be obtained from:
http://www.fhcrc.org/labs/pepe/book/
(Click
on the Programs link and download: aucbs.ado and aucbs.hlp)
Reference:
Dodd LE, Pepe MS. Partial AUC estimation and regression. Biometrics.
59(3):614-23, 2003 Sep.
-
For
inference on the area(s) under one or more ROC curves:
(SINGLE READER; 1-2 "MODALITIES")
-
-
Is data clustered?
If Y ===> Correlated data?
Y ===> CLUSTERBI.FOR (for help, look at CLUSTERBI_HELP.TXT)
N ===> CLUSTER.FOR (for help, look at CLUSTER_HELP.TXT)
If N ===> If rating data, skip to (B)
If continuous data, skip to (C)
-
(For rating data) Would you prefer a parametric or nonparametric
method?
If parametric: Is data correlated?
Y ===> CORROC2.F
N ===> ROCFIT.F (for 1 curve) or INDROC.F (for 2 curves)
If nonparametric: Is data correlated?
Y ===> DELONG.FOR
N ===> DELONG.FOR
NOTE:
I have also included in this subdirectory a SAS program,
ROC_MASTERPIECE.SAS, which computes nonparametric estimates
of ROC area for 2 or more (possibly) correlated modalities
and performs a global chi-square test comparing all the
AUCs.
-
(For
continuous data) Would you prefer a parametric or nonparametric
method?
If parametric: Is data correlated?
Y ===> CLABROC
N ===> LABROC4
If nonparametric: Is data correlated?
Y ===> DELONG.FOR
N ===> DELONG.FOR
NOTE:
I have also included in this subdirectory a SAS program,
ROC_MASTERPIECE.SAS, which computes nonparametric estimates
of ROC area for 2 or more (possibly) correlated modalities
and performs a global chi-square test comparing all the
AUCs.
- For
inference on the area(s) under one or more ROC curves:
(MULTIPLE READERS and/or OTHER COVARIATES)
(for
rating or continuous data) ===> OBUMRM.FOR
(for help, see INSTRUCTIONS.TXT; also see 3 sets of example
input and output files: FORMACONT.DAT, FORMACONT_OUT.DAT, FORMAORDIN.DAT,
FORMAORDIN_OUT.DAT, FORMB.DAT, and FORMB_OUT.DAT; also on the
Web at OBUMRM.html)
This is a FORTRAN program written by Nancy Obuchowski which
implements the Obuchowski-Rockette method for analyzing multireader
and multimodality ROC data. This program does not handle clustered
data (i.e., multiple observations per patient). The program
produces nonparametric estimates of ROC area, but allows for
the comparison of user-input parametric AUCs, partial AUCs,
or sensitivities at a fixed FPR (see "Format B" for examples).
(for rating or continuous data) ===> MULTIVARIATEROC.S
This is an S-PLUS program written by Hemant Ishwaran which uses
the nonparametric method of Delong et al (Biometrics, 1988)
and allows for multiple observations per patient (i.e. ratings
from multiple readers). It does not allow for multiple readers
*and* multiple modalities. This program does not handle missing
data. This program does not come up with an overall average
ROC area among readers; it does provide a global chi-square
test comparing the ROC areas of the readers and allows for linear
contrasts of selected readers (i.e., '1, 2, and 3 vs 4 and 5';
'4 vs 5'). To access this program, go to: multivariateRoc.s
(for rating or continuous data) ===> LABMRMC
This program uses the Dorfman-Berbaum-Metz (DBM) algorithm to
compare multiple readers and multiple treatments. The program
uses jackknifing and ANOVA techniques to test the statistical
significance of the differences between treatments and between
readers. LABMRMC can be downloaded from the website of the Kurt
Rossman Laboratories at the Univ. of Chicago at: http://xray.bsd.uchicago.edu/krl/roc_soft.htm
LABMRMC is suitable for use with subjective probability-rating
scales (0-100); outcomes for each reader-treatment combination
are categorized in an attempt to produce an appropriate spread
of operating points on each ROC curve, followed by application
of the DBM procedure. (Note that all of Dr. Metz's ROC software
is available in three different formats: IBM-compatible, Mac-compatible,
and generic text-only source code for UNIX/VMS environments.)
(Documentation for LABMRMC can be obtained from Dr. Metz's website
or in this directory in the file LABMRMC.DOC)
(for rating data) ===> MRMC
The MRMC (Dorfman, Berbaum, Abu-Dagga, and Schartz, ftp://perception.radiology.uiowa.edu/mrmc32/)
software implements the procedure for discrete rating data (up
to 20 rating categories) as described in Dorfman, Berbaum, and
Metz (1992). This program uses the Dorfman-Berbaum-Metz algorithm
to compare multiple readers and multiple treatments.
LINKS
TO ADDITIONAL ROC SOFTWARE (degenerate data; Bayesian ROC methods;
LROC analysis; verification bias; etc.)
- http://xray.bsd.uchicago.edu/krl/roc_soft.htm
===> the ROC site of the Kurt Rossman Laboratories at the Univ.
of Chicago; download ROCKIT, LABMRMC, PlotROC.xls, ROCFIT, LABROC1,
CORROC2, CLABROC, INDROC, ROCPWRPC)
- Pan
and Metz have developed a program, PROPROC, for "hooked" data
(i.e., the ROC curve crosses below the diagonal "chance line")
and/or degenerate data. They offer the following definition (Pan
X and Metz CE. The "proper" binormal model: Parametric ROC curve
estimation with degenerate data. Academic Radiology 1997;4:380-389):
"ROC
data sets are said to be 'degenerate' when they can be fit exactly
by a conventional binormal ROC curve that consists of only horizontal
and vertical line segments. Data degeneracy can be quite common
in circumstances where data sets include only a small number
of cases and/or the data are obtained on a discrete ordinal
scale in which the categories are few and poorly distributed...
In general, a degenerate data set involves empty cells in
the data matrix that represents the outcome of an ROC experiment
that employs a discrete (e.g., five-category) confidence-rating
scale, and particular patterns of these empty data cells inherently
cause iterative maximization procedures...to fail to converge...
. Dorfman and Berbaum (see below) developed an approach and
a corresponding computer program, RSCORE4, that is able to
estimate ROC curves for degenerate data sets. Their approach
is still based on the conventional binormal ROC model, but
it eliminates degeneracy by assigning small positive values
to all empty cells. ... We have proposed an alternative approach
to the problem of data degeneracy that employs a "proper"
binormal model, and we have developed a corresponding computer
program, PROPROC, for maximum-likelihood estimation of proper
binormal ROC curves."
PROPROC is a parametric approach, developed for discrete categorical
data but also applicable to continuous data. It is a zip file
located in: ftp://minira.bsd.uchicago.edu/roc/ibmpc/ There is
no "generic" version of the code. To obtain a UNIX-compatible
version, rewrite the file IOFILE.F so that it does not use the
WINDOWS API.
- ftp://perception.radiology.uiowa.edu/
===> Kevin Berbaum's ROC software site; download MRMC; includes
RSCORE for degenerate rating data (see "RSCORE 4.66 User's Manual.pdf"
in this directory, or go to Berbaum's site for further information);
also includes BIGAMMA for degenerate rating data (both RSCORE
and BIGAMMA are parametric and have the same data-input format;
Berbaum recommends trying RSCORE before BIGAMMA, and if the ROC
curves don't make sense or are difficult to interpret- i.e., cross
the "chance line"- then try BIGAMMA) (also see PROPROC, above,
for degenerate data)
- http://www.radiology.arizona.edu/krupinski/mips/rocprog.html
===> Univ. of Arizona Radiology Department website; download
most of the Metz and Dorfman/Berbaum software discussed previously;
also download Richard Swensson's LROC, which takes target location
into account in analyzing ROC data
- Those
interested in a Bayesian approach to ROC analysis are directed
to: http://www-math.bgsu.edu/~albert/ord_book/Chapter5/ This directory
contains MATLAB software and an example ("Example 2") from Chapter
5 of Ordinal Data Modeling, by Johnson and Albert, Springer (NY),
2001
- Andrew
Zhou has developed software to deal with verification bias in
ROC analyses. Andrew's COMPROC program is not currently available
online, but his email address (as of June 2002) for those interested
in this software is: Andrew.Zhou@med.va.gov
Verification bias is a phenomenon that occurs when inclusion in
an ROC study depends on the availability of a gold standard which
is only obtained on a particular nonrandom subset of patients
in the study population (i.e., only women with "suspicious" mammograms
get biopsied, and biopsy is the gold standard).
- Nancy
Obuchowski of the Cleveland Clinic has developed FORTRAN software
to provide an ROC-type summary measure of accuracy when the gold
standard is ordinal or continuous, rather than dichotomous. (See:
Obuchowski NA. An ROC-Type Measure of Accuracy When the Gold Standard
is Rank- or Continuous- Scale. Submitted for publication.)
Artificially dichotomizing ordinal or continuous gold standards
in order to fit them into a traditional ROC analysis leads to
bias and inconsistency in estimates of diagnostic accuracy. Obuchowski
provides nonparametric estimators of accuracy which have interpretations
analogous to the AUC, and permit the use of ordinal ( ROC_ORDGOLD.for)
or continuous (ROC_CONTGS.for) gold standards.
|