Title: A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment
Abstract: Medical EducationVolume 49, Issue 2 p. 161-173 Medical Education in Review A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment Jonathan S Ilgen, Corresponding Author Jonathan S Ilgen Division of Emergency Medicine, Department of Medicine, University of Washington School of Medicine, Seattle, Washington, USACorrespondence: Jonathan S Ilgen, Division of Emergency Medicine, University of Washington School of Medicine, Harborview Medical Center, 325 9th Avenue, Box 359702, Seattle, Washington 98104-2499, USA. E-mail: [email protected]Search for more papers by this authorIrene W Y Ma, Irene W Y Ma Department of Medicine, University of Calgary, Calgary, Alberta, CanadaSearch for more papers by this authorRose Hatala, Rose Hatala Department of Medicine, University of British Columbia, Vancouver, British Columbia, CanadaSearch for more papers by this authorDavid A Cook, David A Cook Mayo Clinic Multidisciplinary Simulation Center, Mayo Clinic College of Medicine, Rochester, Minnesota, USA Division of General Internal Medicine, Mayo Clinic, Rochester, Minnesota, USASearch for more papers by this author Jonathan S Ilgen, Corresponding Author Jonathan S Ilgen Division of Emergency Medicine, Department of Medicine, University of Washington School of Medicine, Seattle, Washington, USACorrespondence: Jonathan S Ilgen, Division of Emergency Medicine, University of Washington School of Medicine, Harborview Medical Center, 325 9th Avenue, Box 359702, Seattle, Washington 98104-2499, USA. E-mail: [email protected]Search for more papers by this authorIrene W Y Ma, Irene W Y Ma Department of Medicine, University of Calgary, Calgary, Alberta, CanadaSearch for more papers by this authorRose Hatala, Rose Hatala Department of Medicine, University of British Columbia, Vancouver, British Columbia, CanadaSearch for more papers by this authorDavid A Cook, David A Cook Mayo Clinic Multidisciplinary Simulation Center, Mayo Clinic College of Medicine, Rochester, Minnesota, USA Division of General Internal Medicine, Mayo Clinic, Rochester, Minnesota, USASearch for more papers by this author First published: 27 January 2015 https://doi.org/10.1111/medu.12621Citations: 212Read the full textAboutPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onEmailFacebookTwitterLinkedInRedditWechat Abstract Context The relative advantages and disadvantages of checklists and global rating scales (GRSs) have long been debated. To compare the merits of these scale types, we conducted a systematic review of the validity evidence for checklists and GRSs in the context of simulation-based assessment of health professionals. Methods We conducted a systematic review of multiple databases including MEDLINE, EMBASE and Scopus to February 2013. We selected studies that used both a GRS and checklist in the simulation-based assessment of health professionals. Reviewers working in duplicate evaluated five domains of validity evidence, including correlation between scales and reliability. We collected information about raters, instrument characteristics, assessment context, and task. We pooled reliability and correlation coefficients using random-effects meta-analysis. Results We found 45 studies that used a checklist and GRS in simulation-based assessment. All studies included physicians or physicians in training; one study also included nurse anaesthetists. Topics of assessment included open and laparoscopic surgery (n = 22), endoscopy (n = 8), resuscitation (n = 7) and anaesthesiology (n = 4). The pooled GRS–checklist correlation was 0.76 (95% confidence interval [CI] 0.69–0.81, n = 16 studies). Inter-rater reliability was similar between scales (GRS 0.78, 95% CI 0.71–0.83, n = 23; checklist 0.81, 95% CI 0.75–0.85, n = 21), whereas GRS inter-item reliabilities (0.92, 95% CI 0.84–0.95, n = 6) and inter-station reliabilities (0.80, 95% CI 0.73–0.85, n = 10) were higher than those for checklists (0.66, 95% CI 0–0.84, n = 4 and 0.69, 95% CI 0.56–0.77, n = 10, respectively). Content evidence for GRSs usually referenced previously reported instruments (n = 33), whereas content evidence for checklists usually described expert consensus (n = 26). Checklists and GRSs usually had similar evidence for relations to other variables. Conclusions Checklist inter-rater reliability and trainee discrimination were more favourable than suggested in earlier work, but each task requires a separate checklist. Compared with the checklist, the GRS has higher average inter-item and inter-station reliability, can be used across multiple tasks, and may better capture nuanced elements of expertise. Supporting Information Filename Description medu12621-sup-0001-SupInfo.docxWord document, 619.2 KB Figure S1. Meta-analysis of correlation coefficients between global rating scales and checklists. Figure S2. Meta-analysis of inter-rater reliability of global rating scales and checklists. Figure S3. Meta-analysis of inter-item reliability of global rating scales and checklists. Figure S4. Meta-analysis of inter-station reliability of global rating scales and checklists. Appendix S1. Inter-rater reliability for data abstraction by study investigators. Appendix S2. Trial flow diagram. medu12621-sup-0002-AppS3-4.docxWord document, 32.2 KB Appendix S3. Methodological quality, as evaluated by the Medical Education Research Study Quality Instrument (MERSQI). Appendix S4. Comparisons of validity evidence for relations to other variables between global rating scales (GRS) and checklists (CL). Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article. References 1Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson M. OSCE checklists do not capture increasing levels of expertise. Acad Med 1999; 74: 1129–34. 2Regehr G, MacRae H, Reznick R, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998; 73: 993–7. 3Ringsted C, Østergaard D, Ravn L, Pedersen JA, Berlac PA, van der Vleuten CP. A feasibility study comparing checklists and global rating forms to assess resident performance in clinical skills. Med Teach 2003; 25: 654–8. 4Swanson DB, van der Vleuten CP. Assessment of clinical skills with standardised patients: state of the art revisited. Teach Learn Med 2013; 25 (Suppl 1): 17–25. 5Archer JC. State of the science in health professional education: effective feedback. Med Educ 2010; 44: 101–8. 6van der Vleuten CPM, Norman GR, De Graaff E. Pitfalls in the pursuit of objectivity: issues of reliability. Med Educ 1991; 25: 110–8. 7Norman G. Checklists vs. ratings, the illusion of objectivity, the demise of skills and the debasement of evidence. Adv Health Sci Educ Theory Pract 2005; 10: 1–3. 8Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide to their Development and Use, 4th edn. New York, NY: Oxford University Press 2008. 9Cunnington JPW, Neville AJ, Norman GR. The risks of thoroughness: reliability and validity of global ratings and checklists in an OSCE. Adv Health Sci Educ Theory Pract 1996; 1: 227–33. 10Norman GR, van der Vleuten CPM, De Graaff E. Pitfalls in the pursuit of objectivity: issues of validity, efficiency and acceptability. Med Educ 1991; 25: 119–26. 11Hodges B, McIlroy JH. Analytic global OSCE ratings are sensitive to level of training. Med Educ 2003; 37: 1012–6. 12Govaerts MB, van der Vleuten CM, Schuwirth LT, Muijtjens AM. Broadening perspectives on clinical performance assessment: rethinking the nature of in-training assessment. Adv Health Sci Educ Theory Pract 2007; 12: 239–60. 13Eva KW, Hodges BD. Scylla or Charybdis? Can we navigate between objectification and judgement in assessment? Med Educ 2012; 46: 914–9. 14Schuwirth LWT, van der Vleuten CPM. A plea for new psychometric models in educational assessment. Med Educ 2006; 40: 296–300. 15Lievens F. Assessor training strategies and their effects on accuracy, interrater reliability, and discriminant validity. J Appl Psychol 2001; 86: 255–64. 16Holmboe ES, Hawkins RE, Huot SJ. Effects of training in direct observation of medical residents’ clinical competence: a randomised trial. Ann Intern Med 2004; 140: 874–81. 17Kogan JR, Hess BJ, Conforti LN, Holmboe ES. What drives faculty ratings of residents’ clinical skills? The impact of faculty's own clinical skills. Acad Med 2010; 85 (Suppl): 25–8. 18Brannick MT, Erol-Korkmaz HT, Prewett M. A systematic review of the reliability of objective structured clinical examination scores. Med Educ 2011; 45: 1181–9. 19Khan KZ, Gaunt K, Ramachandran S, Pushkar P. The objective structured clinical examination (OSCE): AMEE Guide No. 81. Part II: organisation and administration. Med Teach 2013; 35: 1447–63. 20Hettinga AM, Denessen E, Postma CT. Checking the checklist: a content analysis of expert- and evidence-based case-specific checklist items. Med Educ 2010; 44: 874–83. 21Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Technology-enhanced simulation to assess health professionals: a systematic review of validity evidence, research methods, and reporting quality. Acad Med 2013; 88: 872–83. 22Brydges R, Hatala R, Zendejas B, Erwin PJ, Cook DA. Linking simulation-based educational assessments and patient-related outcomes: a systematic review and meta-analysis. Acad Med 2014. Epub ahead of print November 4, 2014. doi: 10.1097/ACM.0000000000000549. 23Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann Intern Med 2009; 151: 264–9. 24Cook DA, Hatala R, Brydges R, Zendejas B, Szostek JH, Wang AT, Erwin PJ, Hamstra SJ. Technology-enhanced simulation for health professions education: a systematic review and meta-analysis. JAMA 2011; 306: 978–88. 25Messick S. Validity. In: RL Linn, ed. Educational Measurement, 3rd edn. New York, NY: American Council on Education and Macmillan 1989; 13–103. 26Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: theory and application. Am J Med 2006; 119: 166.e7–16. 27Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA 2007; 298: 1002–9. 28Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003; 327: 557–60. 29Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, Brown M. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 1997; 84: 273–8. 30Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 420–8. 31Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–74. 32Nunnally JC. Psychometric Theory, 2nd edn. New York, NY: McGraw-Hill 1978. 33Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Hillsdale, NJ: Lawrence Erlbaum 1988. 34Murray DJ, Boulet JR, Kras JF, McAllister JD, Cox TE. A simulation-based acute skills performance assessment for anaesthesia training. Anesth Analg 2005; 101: 1127–34. 35White MA, DeHaan AP, Stephens DD, Maes AA, Maatman TJ. Validation of a high fidelity adult ureteroscopy and renoscopy simulator. J Urol 2010; 183: 673–7. 36Finan E, Bismilla Z, Campbell C, Leblanc V, Jefferies A, Whyte HE. Improved procedural performance following a simulation training session may not be transferable to the clinical environment. J Perinatol 2012; 32: 539–44. 37Gordon JA, Alexander EK, Lockley SW, Flynn-Evans E, Venkatan SK, Landrigan CP, Czeisler CA, Harvard Work Hours Health and Safety Group. Does simulator-based clinical performance correlate with actual hospital behaviour? The effect of extended work hours on patient care provided by medical interns. Acad Med 2010; 85: 1583–8. 38Mazor KM, Ockene JK, Rogers HJ, Carlin MM, Quirk ME. The relationship between checklist scores on a communication OSCE and analogue patients’ perceptions of communication. Adv Health Sci Educ Theory Pract 2005; 10: 37–51. 39Sackett PR, Laczo RM, Arvey RD. The effects of range restriction on estimates of criterion interrater reliability: implications for validation research. Pers Psychol 2002; 55: 807–25. 40Holmboe ES, Ward DS, Reznick RK, Katsufrakis PJ, Leslie KM, Patel VL, Ray DD, Nelson EA. Faculty development in assessment: the missing link in competency-based medical education. Acad Med 2011; 86: 460–7. 41Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. The role of assessment in competency-based medical education. Med Teach 2010; 32: 676–82. 42Cook DA, Dupras DM, Beckman TJ, Thomas KG, Pankratz VS. Effect of rater training on reliability and accuracy of mini-CEX scores: a randomised, controlled trial. J Gen Intern Med 2009; 24: 74–9. 43Schuwirth LWT, van der Vleuten CPM. Programmatic assessment and Kane's validity perspective. Med Educ 2012; 46: 38–48. 44Schuwirth LW, van der Vleuten CP. Programmatic assessment: from assessment of learning to assessment for learning. Med Teach 2011; 33: 478–85. 45Boulet JR, Murray D. Review article: assessment in anaesthesiology education. Can J Anaesth 2012; 59: 182–92. 46Jansen JJ, Berden HJ, van der Vleuten CP, Grol RP, Rethans J, Verhoeff CP. Evaluation of cardiopulmonary resuscitation skills of general practitioners using different scoring methods. Resuscitation 1997; 34: 35–41. 47Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Testing technical skill via an innovative ‘bench station’ examination. Am J Surg 1997; 173: 226–30. 48Friedlich M, MacRae H, Oandasan I, Tannenbaum D, Batty H, Reznick R, Regehr G. Structured assessment of minor surgical skills (SAMSS) for family medicine residents. Acad Med 2001; 76: 1241–6. 49Morgan PJ, Cleave-Hogg D, Guest CB. A comparison of global ratings and checklist scores from an undergraduate assessment using an anaesthesia simulator. Acad Med 2001; 76: 1053–5. 50Murray D, Boulet J, Ziv A, Woodhouse J, Kras J, McAllister J. An acute care skills evaluation for graduating medical students: a pilot study using clinical simulation. Med Educ 2002; 36: 833–41. 51Adrales GL, Park AE, Chu UB, Witzke DB, Donnelly MB, Hoskins JD, Mastrangelo MJ Jr, Gandsas A. A valid method of laparoscopic simulation training and competence assessment. J Surg Res 2003; 114: 156–62. 52Datta V, Bann S, Beard J, Mandalia M, Darzi A. Comparison of bench test evaluations of surgical skill with live operating performance assessments. J Am Coll Surg 2004; 199: 603–6. 53Murray DJ, Boulet JR, Kras JF, Woodhouse JA, Cox T, McAllister JD. Acute care skills in anaesthesia practice: a simulation-based resident performance assessment. Anesthesiology 2004; 101: 1084–95. 54Weller J, Robinson B, Larsen P, Caldwell C. Simulation-based training to improve acute care skills in medical undergraduates. N Z Med J 2004; 117: U1119. 55Bann S, Davis IM, Moorthy K, Munz Y, Hernandez J, Khan M, Datta V, Darzi A. The reliability of multiple objective measures of surgery and the role of human performance. Am J Surg 2005; 189: 747–52. 56Moorthy K, Munz Y, Adams S, Pandey V, Darzi A. A human factors analysis of technical and team skills among surgical trainees during procedural simulations in a simulated operating theatre. Ann Surg 2005; 242: 631–9. 57Berkenstadt H, Ziv A, Gafni N, Sidi A. The validation process of incorporating simulation-based accreditation into the anaesthesiology Israeli national board exams. Isr Med Assoc J 2006; 8: 728–33. 58Broe D, Ridgway PF, Johnson S, Tierney S, Conlon KC. Construct validation of a novel hybrid surgical simulator. Surg Endosc 2006; 20: 900–4. 59Matsumoto ED, Pace KT, D'A Honey RJ. Virtual reality ureteroscopy simulator as a valid tool for assessing endourological skills. Int J Urol 2006; 13: 896–901. 60Banks EH, Chudnoff S, Karmin I, Wang C, Pardanani S. Does a surgical simulator improve resident operative performance of laparoscopic tubal ligation? Am J Obstet Gynecol 2007; 197: 541.e1–5. 61Fialkow M, Mandel L, Van Blaricom A, Chinn M, Lentz G, Goff B. A curriculum for Burch colposuspension and diagnostic cystoscopy evaluated by an objective structured assessment of technical skills. Am J Obstet Gynecol 2007; 197: 544.e1–6. 62Goff BA, Van Blaricom A, Mandel L, Chinn M, Nielsen P. Comparison of objective, structured assessment of technical skills with a virtual reality hysteroscopy trainer and standard latex hysteroscopy model. J Reprod Med 2007; 52: 407–12. 63Khan MS, Bann SD, Darzi AW, Butler PE. Assessing surgical skill using bench station models. Plast Reconstr Surg 2007; 120: 793–800. 64Zirkle M, Taplin MA, Anthony R, Dubrowski A. Objective assessment of temporal bone drilling skills. Ann Otol Rhinol Laryngol 2007; 116: 793–8. 65Leung RM, Leung J, Vescan A, Dubrowski A, Witterick I. Construct validation of a low-fidelity endoscopic sinus surgery simulator. Am J Rhinol 2008; 22: 642–8. 66Siddiqui NY, Stepp KJ, Lasch SJ, Mangel JM, Wu JM. Objective structured assessment of technical skills for repair of fourth-degree perineal lacerations. Am J Obstet Gynecol 2008; 199: 676.e1–6. 67Chipman JG, Schmitz CC. Using objective structured assessment of technical skills to evaluate a basic skills simulation curriculum for first-year surgical residents. J Am Coll Surg 2009; 209: 364–70.e2. 68Huang GC, Newman LR, Schwartzstein RM, Clardy PF, Feller-Kopman D, Irish JT, Smith CC. Procedural competence in internal medicine residents: validity of a central venous catheter insertion assessment instrument. Acad Med 2009; 84: 1127–34. 69Insel A, Carofino B, Leger R, Arciero R, Mazzocca AD. The development of an objective model to assess arthroscopic performance. J Bone Joint Surg Am 2009; 91: 2287–95. 70LeBlanc VR, Tabak D, Kneebone R, Nestel D, MacRae H, Moulton CA. Psychometric properties of an integrated assessment of technical and communication skills. Am J Surg 2009; 197: 96–101. 71Faulkner H, Regehr G, Martin J, Reznick R. Validation of an objective structured assessment of technical skill for residents. Acad Med 1996; 71: 1363–5. 72Siddighi S, Kleeman SD, Baggish MS, Rooney CM, Pauls RN, Karram MM. Effects of an educational workshop on performance of fourth-degree perineal laceration repair. Obstet Gynecol 2007; 109: 289–94. 73Adler MD, Vozenilek JA, Trainor JL, Eppich WJ, Wang EE, Beaumont JL, Aitchison PR, Pribaz PJ, Erickson T, Edison M, McGaghie WC. Comparison of checklist and anchored global rating instruments for performance rating of simulated paediatric emergencies. Simul Healthc 2011; 6: 18–24. 74Tuchschmid S, Bajka M, Harders M. Comparing automatic simulator assessment with expert assessment of virtual surgical procedures. In: F Bello, S Cotin, eds. Lecture Notes in Computer Science, Vol. 5958. Berlin Heidelberg: Springer-Verlag 2010; 181–91. 75Ault G, Reznick R, MacRae H, Leadbetter W, DaRosa D, Joehl R, Peters J, Regehr G. Exporting a technical skills evaluation technology to other sites. Am J Surg 2001; 182: 254–6. 76Khan MS, Bann SD, Darzi A, Butler PE. Use of suturing as a measure of technical competence. Ann Plast Surg 2003; 50: 304–8. 77Ponton-Carss A, Hutchison C, Violato C. Assessment of communication, professionalism, and surgical skills in an objective structured performance-related examination (OSPRE): a psychometric study. Am J Surg 2011; 202: 433–40. 78Fleming J, Kapoor K, Sevdalis N, Harries M. Validation of an operating room immersive microlaryngoscopy simulator. Laryngoscope 2012; 122: 1099–103. 79Hall AK, Pickett W, Dagnone JD. Development and evaluation of a simulation-based resuscitation scenario assessment tool for emergency medicine residents. CJEM 2012; 14: 139–46. 80Jabbour N, Reihsen T, Payne NR, Finkelstein M, Sweet RM, Sidman JD. Validated assessment tools for paediatric airway endoscopy simulation. Otolaryngol Head Neck Surg 2012; 147: 1131–5. 81Ma IW, Zalunardo N, Pachev G, Beran T, Brown M, Hatala R, McLaughlin K. Comparing the use of global rating scale with checklists for the assessment of central venous catheterisation skills using simulation. Adv Health Sci Educ Theory Pract 2012; 17: 457–70. 82Nimmons GL, Chang KE, Funk GF, Shonka DC, Pagedar NA. Validation of a task-specific scoring system for a microvascular surgery simulation model. Laryngoscope 2012; 122: 2164–8. 83VanHeest A, Kuzel B, Agel J, Putnam M, Kalliainen L, Fletcher J. Objective structured assessment of technical skill in upper extremity surgery. J Hand Surg Am 2012; 37: 332–7. 84Cicero MX, Riera A, Northrup V, Auerbach M, Pearson K, Baum CR. Design, validity, and reliability of a paediatric resident JumpSTART disaster triage scoring instrument. Acad Pediatr 2013; 13: 48–54. Citing Literature Volume49, Issue2February 2015Pages 161-173 ReferencesRelatedInformation
Publication Year: 2015
Publication Date: 2015-01-27
Language: en
Type: review
Indexed In: ['crossref', 'pubmed']
Access and Citation
Cited By Count: 266
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot