Title: <scp>ROC</scp> or <scp>FROC</scp>? It depends on the research question
Abstract: Medical PhysicsVolume 44, Issue 5 p. 1603-1606 Point/CounterpointFree Access ROC or FROC? It depends on the research question Stephen L. Hillis Ph.D., Stephen L. Hillis Ph.D. [email protected] 319-338-0581 Departments of Radiology and Biostatistics, The University of Iowa, Iowa City, Iowa, 52242-1077 USASearch for more papers by this authorDev P. Chakraborty Ph.D., Dev P. Chakraborty Ph.D. [email protected] 412-349-5305 ExpertCAD Analytics, LLC, 2103 Noble Court, Murrysville, Pennsylvania, 15668 USASearch for more papers by this authorColin G. Orton Ph.D., Moderator, Colin G. Orton Ph.D., ModeratorSearch for more papers by this author Stephen L. Hillis Ph.D., Stephen L. Hillis Ph.D. [email protected] 319-338-0581 Departments of Radiology and Biostatistics, The University of Iowa, Iowa City, Iowa, 52242-1077 USASearch for more papers by this authorDev P. Chakraborty Ph.D., Dev P. Chakraborty Ph.D. [email protected] 412-349-5305 ExpertCAD Analytics, LLC, 2103 Noble Court, Murrysville, Pennsylvania, 15668 USASearch for more papers by this authorColin G. Orton Ph.D., Moderator, Colin G. Orton Ph.D., ModeratorSearch for more papers by this author First published: 07 February 2017 https://doi.org/10.1002/mp.12151Citations: 6 Suggestions for topics suitable for these Point/Counterpoint debates should be addressed to Colin G. Orton, Professor Emeritus, Wayne State University, Detroit: [email protected]. Persons participating in Point/Counterpoint discussions are selected for their knowledge and communicative skill. Their positions for or against a proposition may or may not reflect their personal opinions or the positions of their employers. AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Overview Receiver Operating Characteristic (ROC) and Free-Response Operating Characteristic (FROC) methods are used to assess the accuracy of radiological imaging systems. ROC methods analyze an observer's confidence that an abnormality is or is not present, whereas FROC methods additionally require the observer to locate abnormalities. Typically, ROC and FROC methods are applied to answer different research questions: sometimes ROC is the most appropriate and sometimes FROC. However, some believe that ROC is usually either equivalent or inferior and FROC is preferred over ROC for all research questions. This is the topic debated in this month's Point/Counterpoint. Arguing for the Proposition is Stephen L. Hillis, Ph.D. Dr. Hillis is a research professor in the Departments of Radiology and Biostatistics at the University of Iowa. He earned a Ph.D. in statistics in 1987 and an MFA in music in 1978, both from the University of Iowa. Since 1999, when he first began working with Don Dorfman, Professor of Radiology and Psychology, his research at the University of Iowa has focused on methodology for multi-reader diagnostic radiologic imaging studies. He is the author of 90 articles from many diverse fields, many written when he was Director of the University of Iowa Statistical Consulting Center and Senior Statistician at the Iowa City VA Health Care System. Arguing against the Proposition is Dev P. Chakraborty, Ph.D. Dr. Chakraborty earned his Ph.D. in solid-state physics from the University of Rochester, New York in 1977 then, in 1979, began his career in medical physics working with Ivan Brezovich in the Department of Radiology, University of Alabama at Birmingham, AL, where he worked until 1988 before moving to the Department of Radiology, University of Pennsylvania, Philadelphia. He subsequently moved to the University of Pittsburgh, Pittsburgh, PA, in 1997, where he was Professor in the Department of Bioengineering before assuming his current position at ExpertCAD Analytics, LLC in 2016. He has published over 75 papers in peer-reviewed journals, many in the field of observer performance analysis. For the proposition: Stephen L. Hillis, Ph.D. Opening Statement When comparing imaging modalities in a diagnostic radiologic observer study, what type of data should a researcher collect? Receiver operating characteristic1, 2 data consist of likelihood-of-disease ratings, one for each case (i.e., patient); free-response ROC3, 4 data consist of localization (i.e., specification of location) of suspected diseased areas (e.g., malignant tumors), referred to as targets, and target-specific likelihood-of-disease ratings; and localization ROC5-7 (LROC) consists of both types of data. I argue that the appropriate data type and corresponding analysis combination is the one that best answers the research question. In many medical centers, screening mammography recalled cases will undergo extensive further evaluation, making target location of minor importance.8 Thus, for these centers a researcher might ask, "Which modality is best for classification of cases as diseased versus non-diseased?" Here, an ROC approach is appropriate. An ROC curve answers the question, "If a reader incorrectly classifies X% (e.g., X = 10) of nondiseased cases, what percent of diseased patients does the reader classify correctly?" The ROC area-under the curve (AUC) estimates the probability that a reader will correctly classify a randomly chosen pair of diseased and nondiseased cases. In contrast, for diagnostic mammography for patients with suspicious screening mammograms, accurate localization of all actual targets is necessary to ensure appropriate treatment. Thus, a researcher might ask, "Which modality is best for classification when accurate localization of targets is needed?" Here, a modified LROC approach that requires a reader to accurately localize all of a patient's actual targets is appropriate. The resulting LROC curve answers the question, "If a reader incorrectly classifies X% of non-diseased cases, for what percent of diseased patients does this reader jointly provide correct classification and accurate target localization?" The LROC AUC estimates the probability that a reader will correctly classify a randomly chosen diseased/nondiseased pair of cases and provide accurate target localization. Now consider the frequently used adjusted FROC3 (AFROC) analysis. An estimated AFROC curve answers the question, "If a reader incorrectly classifies X% of non-diseased cases, what percent of actual targets (across patients) will the reader accurately localize?" This statement does not answer either of the two previous research questions. The corresponding jackknife AFROC (JAFROC)4 summary statistic estimates the probability that a randomly selected actual target will be rated higher than the maximum rating given to a randomly selected normal image, which is not clinically relevant to the previous two research questions. On the other hand, AFROC seems more suitable than ROC or LROC for assessing performance of a computer algorithm used in computer-aided diagnosis (CAD) to suggest sites for a human reader to examine. In conclusion, different approaches estimate different aspects of reader performance and hence answer different research questions. Furthermore, a researcher may want to estimate two or more aspects of reader performance. In the words of Charles E. Metz:1 "How effective is a particular diagnostic imaging procedure? … To address the question in a meaningful way, we must decide exactly what information is sought, and in answering we must state precisely what information we are giving." Against the proposition: Dev P. Chakraborty, Ph.D. Opening Statement The FROC-paradigm radiologist marks and rates suspicious regions. Based on a proximity criterion, marks close to lesions are credited as correct localizations. ROC is a subset of FROC: if the proximity criteria are large enough, and the radiologist knows it, the two paradigms are indistinguishable: specifically, the radiologist will make at most one mark/case and unmarked cases are "definite normals". A FROC model predicts ROC curves,9 but one cannot go the other way. For interstitial lung disease, where location is implicit, ROC is appropriate, but then so is FROC. However, in clinical tasks which involve finding focal-disease, for example, screening mammography or lung nodules, if the radiologist suspects the patient is diseased, there is at least one associated suspicious location. For these tasks the ROC paradigm obtains a rating that there is disease "somewhere", which begs the question: if disease is "somewhere", why not point to it (Prof. Gary Barnes, private communication ca. 1985)? In fact they do—radiologists mark and annotate suspicious regions, but the ROC paradigm ignores this information, leading to loss of statistical power relative to FROC.4 It is unethical to use a method with lower statistical power when one with greater power is available.10 Over 104 publications, mostly non-US, have used JAFROC to analyze FROC studies. Clinicians have long recognized the importance of accounting for localization,11 and a leading statistician12 has recognized it. Yet there is opposition to JAFROC within the US. Here, is a recent reviewer comment: "…the JAFROC statistic… does not yield a meaningful clinical interpretation". I ask medical physicists: if the probability that lesions are rated higher than nondiseased cases (the JAFROC statistic) equals unity, is this a good thing? I hope your answer is a resounding "yes" because a unit value means all diseased patients are correctly recalled and no nondiseased patients are incorrectly recalled. (A zero value reverses correct/incorrect in the preceding sentence). This is the clinical interpretation the reviewer finds so elusive. The subject of this debate is not "rocket science", but it does require one to be open-minded and unbiased. Dirac13 addressed an analogous then-existing criticism against quantum mechanics, namely, it did not provide a "satisfying picture" (translate "satisfying picture" to the reviewer's "meaningful clinical interpretation") as did classical mechanics. To paraphrase Dirac, the purpose of science is not to provide satisfying "pictures" but to explain phenomena. Since it allows zero or more mark-rating pairs per image, FROC is inherently more complex than ROC, no doubt about it. More importantly, it mirrors clinical practice, which is also more complex than the "somewhere" that the ROC paradigm accommodates. It is about time to stop using "I don't understand" as an excuse for impeding scientific progress and patient care. ROC methods should not be used to analyze localization tasks. Rebuttal: Stephen L. Hillis, Ph.D. An ROC analysis performed using FROC data by treating the highest target rating per case as the decision variable has been called an inferred ROC analysis.14 This is different from a conventional ROC analysis of ROC data where the decision variable is a case-specific overall likelihood-of-disease rating. Although the two analyses are equivalent under the assumption that the two decision variables provide the same case rankings, this assumption has never been conclusively demonstrated empirically.15 Furthermore, the assumption is intuitively not reasonable; for example, it implies that two cases are considered by a reader to have equal likelihood of disease if one case receives one mark and the other receives multiple marks, with each marked site rated as having 80% disease probability. Thus, inferred ROC answers a different question than conventional ROC. Although I disagree with Dr. Chakraborty's statement that "ROC is a subset of FROC," I have no problem agreeing that inferred ROC is a subset of FROC." Dr. Chakraborty has demonstrated through simulations that FROC (specifically, JAFROC) is more powerful than inferred ROC (conventional ROC is not included in these simulations). The problem with this kind of comparison is that JAFROC and inferred ROC have different hypotheses, and hence answer different research questions. In conclusion, I agree that FROC data are required for assessing reader performance with respect to localization. However, this does not mean than one FROC analysis method (e.g., JAFROC) is suitable for all or most research questions. JAFROC is a statistically viable approach, but it should only be used if it answers the research question. A future area for research is development of FROC-ROC analysis methods designed to answer specific research questions. Before new analysis methods can be developed, however, there is a need for identifying important research questions and providing precise statements of them. Rebuttal: Dev P. Chakraborty, Ph.D. Let me address some of the statements made by my distinguished colleague, which I dispute: Location-specific methodologies (LROC/FROC/ROI) were developed not to address some abstract research question, but to better account for clinical reality. In my opinion, Dr. Hillis has the roles of screening and diagnostic mammography reversed. While screening mammography results in a binary decision (recall: yes/no?), radiologists also report locations of suspicious regions. Diagnostic mammograms are used to further investigate these suspicious regions.16 There is an analogous difference between CADe (screening) and CADx (diagnostic).16 The starting point for diagnostics is not that there are suspicious regions "somewhere in the breast", but that there are specific regions found at screening. Yet Dr. Hillis supports using ROC methodology, where location is ignored, to analyze screening mammography. As justification, he repeats an incorrect argument8 that, because recalled cases will undergo "extensive evaluation", location is of "minor importance". The fallacy is that at the end of the extensive evaluation one is down to a few localized regions, whose truth is established by needle-biopsy, that is, one is down to FROC data. The issue is clinical, not statistical. Lesions are location-level manifestations of patient-level disease. A malignant lesion means the patient (not the lesion) has breast cancer. The lesion is not recalled—the patient is. If a lesion is rated higher than a nondiseased case, the patient who is attached to it is also being effectively rated higher, as accounted for in the weighted AFROC (wAFROC) figure-of-merit, which is a case-level figure-of-merit, where every diseased case effectively contributes exactly one lesion.17 If the case is regarded as a random factor, results extrapolate to the population of cases. A "population of lesions", as Dr. Hillis' second-last paragraph seems to imply, is a contradiction in terms, as lesions have no independent existence. If an incorrect location is identified in a diseased case, then the recall is technically "correct" but for the wrong reason18—two canceling errors actually occurred, a missed lesion and a location-level false positive. A modality that minimizes "right for wrong reason" outcomes would have an advantage when analyzed by wAFROC-AUC figure-of-merit3 but not when analyzed by ROC-AUC, because in ROC the two canceling errors count as a perfect decision. Acknowledgments Dr. Hillis thanks Craig Abbey (U.C. Santa Barbara), Kevin Berbaum (U. Iowa), Patrick Brennan (U. Sydney), Leonel Vasquez (U. Iowa), and Tamara Miner Haygood (U. Texas) for their helpful discussions. This research was supported by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under Award Number R01EB019967. Conflicts of interest The authors have no relevant conflicts of interest to disclose. References 1Metz CE. ROC methodology in radiologic imaging. Invest Radiol. 1986; 21: 720– 733. 2Pepe M. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York, NY: Oxford University Press; 2003. 3Chakraborty DP, Winter LHL. Free-response methodology -alternate analysis and a new observer-performance experiment. Radiol. 1990; 174: 873– 881. 4Chakraborty DP, Berbaum KS. Observer studies involving detection and localization: Modeling, analysis and validation. Med Phys. 2004; 31: 2313– 2330. 5Star SJ, Metz CE, Lusted LB, Goodenough DJ. Visual detection and localization of radiographic images 1. Radiol. 1975; 116: 533– 538. 6Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Med Phys. 1996; 23: 1709– 1725. 7Popescu LM. Nonparametric ROC and LROC analysis. Med Phys. 2007; 34: 1556– 1564. 8Gur D, Rockette HE. Performance assessments of diagnostic systems under the FROC paradigm: Experimental, analytical, and results interpretation issues. Academ. Radiol. 2008; 15: 1312– 1315. 9Chakraborty DP. ROC curves predicted by a model of visual search. Phys Med Biol. 2006; 51: 3463– 3482. 10Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA. 2002; 288: 358– 362. 11Black WC. Anatomic extent of disease: A critical variable in reports of diagnostic accuracy. Radiol. 2000; 217: 319– 320. 12Obuchowski NA, Mazzone PJ, Dachman AH. Bias, underestimation of risk, and loss of statistical power in patient-level analyses of lesion detection. Eur Radiol. 2010; 20: 584– 594. 13Dirac PAM. The Principles of Quantum Mechanics. New York, NY: Oxford University Press; 1981. 14Chakraborty DP. Clinical relevance of the ROC and free-response paradigms for comparing imaging system efficacies. Radiat Protect Dosim. 2010; 139: 37– 41. 15Zanca F, Hillis SL, Claus F, et al. Correlation of free-response and receiver-operating-characteristic area-under-the-curve estimates: Results from independently conducted FROC/ROC studies in mammography. Med Phys. 2012; 39: 5917– 5929. 16Firmino M, Angelo G, Morais H, Dantas MR, Valentim R. Computer-aided detection (CADe) and diagnosis (CADx) system for lung cancer with likelihood of malignancy. Biomed Engin Online. 2016; 15: 1– 17. 17Chakraborty DP, Zhai X. On the meaning of the weighted alternative free-response operating characteristic figure of merit. Med Phys. 2016; 43: 2548– 2557. 18Bunch PC, Hamilton JF, Sanderson GK, Simmons AH. A free-response approach to the measurement and characterization of radiographic-observer performance. J Appl Photog Eng. 1978; 4: 166– 171. Citing Literature Volume44, Issue5May 2017Pages 1603-1606 ReferencesRelatedInformation