The Reliability of Rater Variability

Andrea Gingerich

doi:10.4300/JGME-D-20-00163.1

Simulation is well recognized for its affordances for collecting important assessment information.^1–3 In this issue of the Journal of Graduate Medical Education, Andler and colleagues present validity evidence for leveraging the simulation context to provide assessment data for entrustable professional activities (EPAs).⁴ Unfortunately, they found their validity argument hampered by an unexpected finding: despite good interrater reliability for entrustment-based simulation assessment ratings and fair interrater reliability for similar entrustment-based clinical practice ratings, there were no correlations between them. The authors ponder possible explanations for this troublesome finding and suggest that since there was only “fair agreement at best” for some of the behaviors, rater variability might be an explanation for the lack of correlations.

The havoc that rater variability has inflicted on reliability measures has spurred several of us to study its sources.^5–7 Aspects not directly related to the rating scale, such as the context in which assessments take place^8–10 and variations in rater interpretations and judgments,^11–14 have been identified as contributors to rater variability. Thus, I am not surprised to see rater variability when an entrustment scale is used. In fact, as evidence of rater variability continues to accumulate along with increasing recognition of the “plurality of interpretations,”¹⁵ we may be reaching a point where rater variability can no longer be framed as an unexpected finding. Yet, this raises a conundrum for the assessment field. Accepting rater variability as the status quo would complicate plans for collecting and interpreting validity evidence.¹⁶ How can we demonstrate a relationship to other variables without reliability?

In part, the simulation context might offer a solution to this by providing a stable context where raters can be standardized and, themselves, judged. Almost 2 decades ago,¹⁷ medical educators were directed to techniques that optimize interrater reliability—figure skating judging.^18,19 Although it is not free from bias,²⁰ figure skating judging has design features that support rater agreement and interrater reliability. First, judges are trained and monitored so that those who share consensus are invited to continue judging and outlier judges are not. Second, the assessed performance lasts only a few minutes with a specified number of predictable elements that can be performed in a limited number of ways, with each variation assigned a corresponding score. Third, the assessment task is the judge's only task where they directly observe a series of similar performances. They assign ratings immediately after each assessment, and then note how their ratings compare with those of other judges. These design features are incompatible with almost every aspect of workplace-based assessment; however, the simulation context does offer similar affordances.²¹ Yet, I wonder how the design features that aim to minimize all types of unwanted variability would align with the very notion of entrustment-based assessment?

Entrustment, entrustability, and level of supervision scales promised to better mimic the judgments and decisions supervisors make in the workplace.^22,23 The construct of entrustment resonated with the essence of supervision.^24,25 It offered to systematically track subjective expert judgments of overall performance to complement the competence judgments based on observed behaviors that were already being collected and analyzed.²⁶ I was excited about using entrustment as the basis for workplace-based assessment because it had the potential to capture indescribable and nuanced aspects of being a physician that resisted measurement.²⁷ I am not an expert in simulation so I will pose the question to those who are: How well does entrustment align with what raters are doing, thinking, and feeling during simulation? It is not a straightforward question and leads to other difficult questions. What does it mean to entrust in simulation and how does it compare to entrusting in the workplace? For example, is the construct of entrustment most aligned when the rater is exposed to the competing priorities of patient safety, learner autonomy, clinical care, teaching obligations, service efficiency, and learner welfare? In other words, must the rater be simultaneously engaged with supervising the trainee for the construct of entrustment to be sufficiently aligned? If so, which forms of simulation offer that context for raters?

In proposing that entrustment can be used as the basis for assessment in simulation, the latest research of Andler and colleagues offers the opportunity to contemplate the ideal constructs for simulation assessment. If we were without contemporary pressures to provide data to inform EPA decisions, would we choose to use entrustment in this context? The assessment construct of feedback provision (like that used by field notes²⁸) may be better aligned than entrustment if the rater's role in simulation is akin to that of a coach helping a trainee to learn during practice. Or perhaps the predictable and controllable conditions of simulation, similar to that of figure skating judging, could be used to optimize measurement of competence through standardized assessment of performance.

Entrustment-based assessment is rapidly becoming an important component of our assessment tool kit, but I cannot imagine a post-psychometric utopia where all assessments are based on entrustment. All of our assessment modalities (including EPAs), assessment constructs (including entrustment), and assessment contexts (including simulation) have strengths to be leveraged and limitations to be accommodated. Fortunately, the limitations of one can be strategically addressed by the strengths of another with its own limitations supported by yet another context or construct or modality.²⁹ I am eager to see how the strengths of the simulation assessment context and the construct of entrustment can contribute to an assessment program that is more informative than the sum of its parts.

[1] 1

Amin Z,

Boulet JR,

Cook DA,

Ellaway R,

Fahal A,

Kneebone R,
et al. Technology-enabled assessment of health professions education: consensus statement and recommendations from the Ottawa 2010 conference. Med Teach. 2011;33(
5
):364–369. doi:10.3109/0142159X.2011.565832.

OpenURL
PubMed
Google Scholar
Crossref

[2] OpenURL

[3] PubMed

[4] Google Scholar

[5] Crossref

[6] 2

St-Onge C,

Lineberry M.
Simulation for assessment. In:
Chiniara G,
ed. Clinical Simulation. 2nd ed.
Cambridge, MA
:
Academic Press;
2019:867–877.

OpenURL
PubMed
Google Scholar
Crossref

[7] OpenURL

[8] PubMed

[9] Google Scholar

[10] Crossref

[11] 3

Boulet JR.
Summative assessment in medicine: the promise of simulation for high-stakes evaluation. Acad Emerg Med. 2008;15(
11
):1017–1024. doi:10.1111/j.1553-2712.2008.00228.x.

OpenURL
PubMed
Google Scholar
Crossref

[12] OpenURL

[13] PubMed

[14] Google Scholar

[15] Crossref

[16] 4

Andler C,

Kowalek K,

Boscardin C,

van Schaik SM.
E-ASSESS: creating an EPA assessment tool for structured simulated emergency scenarios. J Grad Med Educ. 2020;12(
2
):153–158.

OpenURL
PubMed
Google Scholar
Crossref

[17] OpenURL

[18] PubMed

[19] Google Scholar

[20] Crossref

[21] 5

Gingerich A,

Kogan J,

Yeates P,

Govaerts M,

Holmboe E.
Seeing the ‘black box' differently: assessor cognition from three research perspectives. Med Educ. 2014;48(
11
):1055–1068. doi:10.1111/medu.12546.

OpenURL
PubMed
Google Scholar
Crossref

[22] OpenURL

[23] PubMed

[24] Google Scholar

[25] Crossref

[26] 6

Gauthier G,

St-Onge C,

Tavares W.
Rater cognition: review and integration of research findings. Med Educ. 2016;50(
5
):511–522. doi:10.1111/medu.12973.

OpenURL
PubMed
Google Scholar
Crossref

[27] OpenURL

[28] PubMed

[29] Google Scholar

[30] Crossref

[31] 7

Lee V,

Brain K,

Martin J.
Factors influencing Mini-CEX rater judgments and their practical implications: a systematic literature review. Acad Med. 2017;92(
6
):880–887. doi:10.1097/ACM.0000000000001537.

OpenURL
PubMed
Google Scholar
Crossref

[32] OpenURL

[33] PubMed

[34] Google Scholar

[35] Crossref

[36] 8

Gingerich A.
Comparatively salient: examining the influence of preceding performances on assessors' focus and interpretations in written assessment comments. Adv Health Sci Educ Theory Pract. 2018;23(
5
):937–959. doi:10.1007/s10459-018-9841-2.

OpenURL
PubMed
Google Scholar
Crossref

[37] OpenURL

[38] PubMed

[39] Google Scholar

[40] Crossref

[41] 9

Lee V,

Brain K,

Martin J.
From opening the ‘black box' to looking behind the curtain: cognition and context in assessor-based judgements. Adv Health Sci Educ Theory Pract.2019;24(
1
):85–102. doi:10.1007/s10459-018-9851-0.

OpenURL
PubMed
Google Scholar
Crossref

[42] OpenURL

[43] PubMed

[44] Google Scholar

[45] Crossref

[46] 10

Kogan JR,

Conforti L,

Bernabeo E,

Iobst W,

Holmboe E.
Opening the black box of clinical skills assessment via observation: a conceptual model. Med Educ. 2011;45(
10
):1048–1060. doi:10.1111/j.1365-2923.2011.04025.x.

OpenURL
PubMed
Google Scholar
Crossref

[47] OpenURL

[48] PubMed

[49] Google Scholar

[50] Crossref

[51] 11

Govaerts MJ,

Van de Wiel MW,

Schuwirth LW,

Van der Vleuten CP,

Muijtjens AM.
Workplace-based assessment: raters' performance theories and constructs. Adv Health Sci Educ Theory Pract. 2013;18(
3
):375–396. doi:10.1007/s10459-012-9376-x.

OpenURL
PubMed
Google Scholar
Crossref

[52] OpenURL

[53] PubMed

[54] Google Scholar

[55] Crossref

[56] 12

St-Onge C,

Chamberland M,

Lévesque A,

Varpio L.
Expectations, observations, and the cognitive processes that bind them: expert assessment of examinee performance. Adv Health Sci Educ Theory Pract. 2016;21(
3
):627–642. doi:10.1007/s10459-015-9656-3.

OpenURL
PubMed
Google Scholar
Crossref

[57] OpenURL

[58] PubMed

[59] Google Scholar

[60] Crossref

[61] 13

Tavares W,

Ginsburg S,

Eva KW.
Selecting and simplifying: rater performance and behavior when considering multiple competencies. Teach Learn Med. 2016;28(
1
):41–51. doi:10.1080/10401334.2015.1107489.

OpenURL
PubMed
Google Scholar
Crossref

[62] OpenURL

[63] PubMed

[64] Google Scholar

[65] Crossref

[66] 14

Yeates P,

O'Neill P,

Mann K,

Eva K.
Seeing the same thing differently: mechanisms that contribute to assessor differences in directly-observed performance assessments. Adv Health Sci Educ Theory Pract. 2013;18(
3
):325–341. doi:10.1007/s10459-012-9372-1.

OpenURL
PubMed
Google Scholar
Crossref

[67] OpenURL

[68] PubMed

[69] Google Scholar

[70] Crossref

[71] 15

Hodwitz K,

Kuper A,

Brydges R.
Realizing one's own subjectivity: assessors' perceptions of the influence of training on their conduct of workplace-based assessments. Acad Med. 2019;94(
12
):1970–1979. doi:10.1097/ACM.0000000000002943.

OpenURL
PubMed
Google Scholar
Crossref

[72] OpenURL

[73] PubMed

[74] Google Scholar

[75] Crossref

[76] 16

Cook DA,

Hatala R.
Validation of educational assessments: a primer for simulation and beyond. Adv Simul (Lond). 2016;1:31. doi:10.1186/s41077-016-0033-y.

OpenURL
PubMed
Google Scholar
Crossref

[77] OpenURL

[78] PubMed

[79] Google Scholar

[80] Crossref

[81] 17

Williams RG,

Klamen DA,

McGaghie WC.
Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15(
4
):270–292. doi:10.1207/S15328015TLM1504_11.

OpenURL
PubMed
Google Scholar
Crossref

[82] OpenURL

[83] PubMed

[84] Google Scholar

[85] Crossref

[86] 18

Huang J,

Foote CJ.
Using generalizability theory to examine scoring reliability and variability of judging panels in skating competitions. J Quant Analy Sport. 2011;7(3).https://doi.org/10.2202/1559-0410.1241. Accessed February 24, 2020.

OpenURL
PubMed
Google Scholar
Crossref

[87] OpenURL

[88] PubMed

[89] Google Scholar

[90] Crossref

[91] 19

Weekley JA,

Gier JA.
Ceiling in the reliability and validity of performance ratings: the case of expert raters. Acad Manag J. 1989;32(
1
):213–222. doi:10.5465/256428.

OpenURL
PubMed
Google Scholar
Crossref

[92] OpenURL

[93] PubMed

[94] Google Scholar

[95] Crossref

[96] 20

Sala BR,

Scott JT,

Spriggs JF.
The Cold War on ice: constructivism and the politics of Olympic figure skating judging. Perspect Politics. 2007;5(
1
):17–29. doi:10.1017/S153759270707003X.

OpenURL
PubMed
Google Scholar
Crossref

[97] OpenURL

[98] PubMed

[99] Google Scholar

[100] Crossref

[101] 21

Weersink K,

Hall AK,

Rich J,

Szulewski A,

Dagnone JD.
Simulation versus real-world performance: a direct comparison of emergency medicine resident resuscitation entrustment scoring. Adv Simul (Lond). 2019;4(
1
):9. doi:10.1186/s41077-019-0099-4.

OpenURL
PubMed
Google Scholar
Crossref

[102] OpenURL

[103] PubMed

[104] Google Scholar

[105] Crossref

[106] 22

ten Cate O.
Nuts and bolts of entrustable professional activities. J Grad Med Educ. 2013;5(
1
):157–158. doi:10.4300/JGME-D-12-00380.1.

OpenURL
PubMed
Google Scholar
Crossref

[107] OpenURL

[108] PubMed

[109] Google Scholar

[110] Crossref

[111] 23

Rekman J,

Gofton W,

Dudek N,

Gofton T,

Hamstra SJ.
Entrustability scales: outlining their usefulness for competency-based clinical assessment. Acad Med. 2016;91(
2
):186–190. doi:10.1097/ACM.0000000000001045.

OpenURL
PubMed
Google Scholar
Crossref

[112] OpenURL

[113] PubMed

[114] Google Scholar

[115] Crossref

[116] 24

Pangaro L,

ten Cate O.
Frameworks for learner assessment in medicine: AMEE guide no. 78. Med Teach. 2013;35(
6
):e1197–e1210. doi:10.3109/0142159X.2013.788789.

OpenURL
PubMed
Google Scholar
Crossref

[117] OpenURL

[118] PubMed

[119] Google Scholar

[120] Crossref

[121] 25

Crossley J,

Johnson G,

Booth J,
Wade W. Good questions, good answers: construct alignment improves the performance of workplace-based assessment scales. Med Educ. 2011;45(
6
):560–569. doi:10.1111/j.1365-2923.2010.03913.x.

OpenURL
PubMed
Google Scholar
Crossref

[122] OpenURL

[123] PubMed

[124] Google Scholar

[125] Crossref

[126] 26
Ten Cate O, Scheele F. Competency-based postgraduate training: can we bridge the gap between theory and clinical practice?Acad Med. 2007;82(
6
):542–547. doi:10.1097/ACM.0b013e31805559c7.

OpenURL
PubMed
Google Scholar
Crossref

[127] OpenURL

[128] PubMed

[129] Google Scholar

[130] Crossref

[131] 27

Gingerich A.
What if the ‘trust' in entrustable were a social judgement? Med Educ. 2015;49(
8
):750–752. doi:10.1111/medu.12772.

OpenURL
PubMed
Google Scholar
Crossref

[132] OpenURL

[133] PubMed

[134] Google Scholar

[135] Crossref

[136] 28

Ross S,

Poth CN,

Donoff M,

Humphries P,

Steiner I,

Schipper S,
et al. Competency-based achievement system: using formative feedback to teach and assess family medicine residents' skills. Can Fam Physician. 2011;57(
9
):e323–e330.

OpenURL
PubMed
Google Scholar
Crossref

[137] OpenURL

[138] PubMed

[139] Google Scholar

[140] Crossref

[141] 29

Van der Vleuten C,

Schuwirth L,

Driessen E,

Dijkstra J,

Tigelaar D,

Baartman LK,
et al. A model for programmatic assessment fit for purpose. Med Teach. 2012;34(
3
):205–214. doi:10.3109/0142159X.2012.652239.

OpenURL
PubMed
Google Scholar
Crossref

[142] OpenURL

[143] PubMed

[144] Google Scholar

[145] Crossref

Article Contents

The Reliability of Rater Variability

Rouge on the Lips of Silence

Evaluating Methodology for Increasing Diversity in US Residency Training Programs: A Scoping Review

Scoping Review of Simulation-Based Training for Social Determinants of Health Within Residency Programs

Career Outcomes Among Graduates of 2 Urban Health Primary Care Training Programs

Trends in MedEdPORTAL Faculty Development Resources for Clinician Educators

The Effect of Paging Reminders on Fellowship Conference Attendance: A Multi-Program Randomized Crossover Study

A Values Affirmation Intervention to Improve Female Residents' Surgical Performance

Improving Residents' Safe Opioid Prescribing for Chronic Pain Using an Objective Structured Clinical Examination

Integrating a Resident-Driven Longitudinal Quality Improvement Curriculum Within an Ambulatory Block Schedule

Skills for Interviewing Adolescent Patients: Sustainability of Structured Feedback in Undergraduate Education on Performance in Residency

Get Email Alerts