Taking Rater Exposure to Trainees Into Account When Explaining Rater Variability

PhD,
PhD, and
PhD
Online Publication Date: 01 Dec 2016
Page Range: 726 – 730
DOI: 10.4300/JGME-D-16-00122.1
Save
Download PDF

ABSTRACT

Background 

Rater-based judgments are widely used in graduate medical education to provide more meaningful assessments, despite concerns about rater reliability.

Objective 

We introduced a statistical modeling technique that corresponds to the new rater reliability framework, and present a case example to provide an illustration of the utility of this new approach to assessing rater reliability.

Methods 

We used mixed-effects models to simultaneously incorporate random effects for raters and systematic effects of rater role as fixed effects. Study data are clinical performance ratings collected from medical school graduates who were evaluated for their readiness for supervised clinical practice in authentic simulation settings at 2 medical schools in the Netherlands and Germany.

Results 

The medical schools recruited a maximum of 30 graduates out of 60 (50%) and 180 (17%) eligible candidates, respectively. Clinician raters (n = 25) for the study were selected based on their level of expertise and experience. Graduates were assessed on 7 facets of competence (FOCs) that are considered important in supervisors' entrustment decisions across the 5 cases used. Rater role was significantly associated with 2 FOCs: (1) teamwork and collegiality, and (2) verbal communication with colleagues/supervisors. For another 2 FOCs, rater variability was only partially explained by the role of the rater (a proxy for the amount of direct interaction with the trainee).

Conclusions 

Consideration of raters as meaningfully idiosyncratic provides a new framework to explore their influence on assessment scores, which goes beyond considering them as random sources of variability.

Introduction

Assessment of clinical performance of learners and physicians in a real practice context is critically important, but issues of reliability and feasibility make it a challenging task. Although the utility and the importance of alternative approaches to assessment, including workplace-based assessment, multisource feedback, and interprofessional teamwork assessments, are widely recognized, outstanding issues surrounding their reliability compromises the potential utility and adoptability of these methods for summative assessment purposes.1,2

Recently, normalization of ratings data has been proposed as a way to reduce bias, but this approach still does not address how to best interpret and investigate the sources of these inconsistencies.3 In a recent review, Gingerich et al4 introduced 3 perspectives, 1 of which is “the assessor as meaningfully idiosyncratic.” Specifically, rater perceptions of a trainee's performance are based on outcomes of complex interplay between the trainee, the rater, and the environment.57 Accordingly, the various observers interacting with residents will be privy to different sets of observations, depending on their role and interaction with the trainee. Recent studies by Govaerts et al8 and Gingerich et al4,9 provide greater insight into the underpinnings of rater behavior to help understand that what appears to be idiosyncrasies, on closer examination may reveal more systematic features of rater perception. In alignment with this view, we believe these idiosyncrasies should be accommodated and may reflect important differences.1012

From a traditional psychometric view, idiosyncrasies in ratings reduce reliability, and should be minimized through various methods (ie, rater training, consensus rating). Contrary to this view, we propose that instead of trying to decrease diversity of perspectives, the reliability estimates should allow for this variability, and encompass rater factors (ie, rater characteristics) to help explore and examine the idiosyncrasies. The inclusion of rater factors into the reliability estimate provides several advantages: (1) it allows for variability of raters without arbitrarily forcing the ratings to consensus; (2) it helps explain the sources of variability in the performance ratings; and (3) it may increase overall reliability. We propose using a mixed-effects model to incorporate both random effects for raters and systematic effects of rater characteristics (ie, rater role, rater experience, rater conditions, etc) as fixed effects to gain more detailed information around rater variability.

In this article we introduce a statistical modeling technique that corresponds to this new rater reliability framework using a case example as an illustration of the utility of this new approach.

Methods

To explore the effect of including rater descriptions on the reliability estimates, we used clinical performance rating data collected from recent medical school graduates evaluated for their readiness for supervised clinical practice in authentic simulation settings in the Netherlands and Germany. A brief description of the study setting and assessment procedure is provided in the sections that follow; a more detailed information is provided in Wijnen-Meijer et al.13

Setting

The graduates who had just completed undergraduate medical education at medical schools in Utrecht, the Netherlands, and Hamburg, Germany, participated in a simulated environment in the role of a beginning resident on a busy inpatient unit. The assessment consisted of 3 phases. First, graduates encountered 5 standardized patients (SPs) portraying patients with uncommon medical problems who had just been admitted to the hospital. In the second phase, after the patient encounters, graduates were given time to request lab results and gather additional information to determine differential diagnoses along with a management plan for each patient to present to the supervisor at the end of the day. During this phase, graduates also were given the opportunity to call their supervisors by phone if needed, and they also had a brief face-to-face meeting with the supervisor to discuss progress. During the third phase, graduates were given 30 minutes to present their differential diagnoses and management plans for the 5 SPs.

Participants

Each of the 2 schools recruited a maximum of 30 graduates from each institution, representing 50% out of 60 and 17% out of 180 eligible candidates. Clinician raters (n = 25) were selected to participate in the study based on their level of expertise and experience. Each graduate was assessed by 3 raters, and the raters had 3 distinct roles: (1) acting as the graduate's personal supervisor during the assessment; (2) being present for the entire simulation and listening to telephone and face-to-face conversations between the supervisor and the graduate (but without direct contact with the graduate); and (3) observing the graduate only during the final reporting phase. The 3 role categories represent the rater roles typically encountered during graduate training, and were included in the analysis as fixed effects. All 25 raters participated in each of the 3 rater role categories by random rotation.

Assessments

Raters were asked to rate the overall performance on a 5-point scale from 1 (weak) to 5 (very good) on 7 facets of competence (FOCs) considered to be key components in making entrustment decisions by supervisors about the residents across 5 SP cases. The FOCs inform the evaluation of entrustable professional activities in the larger study, which also follows the 5-level entrustable professional activity supervision scale.13 The 7 FOCs that were rated included (1) scientific and empirical grounded method of working; (2) knowing and maintaining one's own personal bounds and possibilities; (3) teamwork and collegiality; (4) verbal communication with colleagues and supervisors; (5) responsibility; (6) safety and risk management; and (7) active professional development. All 60 trainees were rated on the 7 FOCs by 3 raters representing 3 different levels of interaction with the trainee.

The Netherlands Association for Medical Education Ethical Review Board and the State of Hamburg Physicians Ethics Board provided ethical approval for the study.

Analysis

First, we estimated the variance components for rater effect by using a random-effects model as the baseline model. This represents the traditional approach to estimating reliability.

Step 1: Random-Effects Model

Where β0 = the average rating; βi = the trainee random effect; βr(ij) = the rater random effect; and eij = random errors. This provided information about the variance components associated with rater, trainee, and error. Then, we employed a mixed-effects model to include the rater role (the amount and type of interaction with the trainee as described in the participant section) as fixed effects in addition to the baseline random-effects model. The purpose of adding the rater role as a fixed effect is to help explain the variability in raters and decrease the variance components related to raters (thus increasing reliability).

Step 2: Mixed-Effects Model With Rater Role as Fixed Effects

Where β0 = the average rating; βi = the student random effect; eij = random errors; βr(ij) = the rater effect; γ0 = fixed effect of rater role; and γr(ij) = random effect of rater.

In Step 3, we used the estimates from the regression analysis and mixed-effects models to derive estimates of the variance components analogous to a generalizability study approach. In doing so, we expected an overall increase in reliability. A more detailed description of procedures for estimating variance components and reliability coefficients from regression estimates is provided by Shavelson and Webb.14 We used Stata version 13.0 (StataCorp LP, College Station, TX) for all statistical analyses.

Results

All trainees were rated by 3 raters representing 3 different levels of interaction with the trainee on a 5-point scale on the 7 FOCs, including: (1) scientific and empirical grounded method of working; (2) knowing and maintaining one's own personal bounds and possibilities; (3) teamwork and collegiality; (4) verbal communication with colleagues and supervisors; (5) responsibility; (6) safety and risk management; and (7) active professional development.

Variance Components

The variance components associated with raters for the 7 FOCs ranged from 0% to 20%, indicating overall high rater reliability. As shown in the table, the 2 competency domains with the highest rater variability (low reliability) were teamwork and collegiality (domain 3, 20% of the total variance) and verbal communication with colleagues and supervisors (domain 4, 19% of the total variance).

table Variance Components for the 7 Ratings With and Without Rater Characteristics

            
              table

Rater Category Effect

In the mixed-effects model, with the inclusion of the 3 rater role categories as a fixed effect, we examined the effect of rater characteristics on rater variability. Rater role was significantly associated with the same 2 FOCs that had highest rater variability (teamwork and collegiality; verbal communication with colleagues and supervisors). In both instances, the third rater role (only observing the reporting phase) was associated with significantly lower ratings on both teamwork and collegiality (β = −0.38, P = .049) and verbal communication with colleagues (β = −0.48, P = .021).

Differences in the Variance Components

For the 2 FOCs with the significant rater role effect, the variance components representing the rater variability decreased. The variance component representing the rater effect for the teamwork and collegiality domain was reduced from 20% to 18%, and for the verbal communication domain, the variance component for raters decreased from 19% to 15% after including the rater role in the model. This translated into an increase of about 0.20 in overall reliability from 0.78 to 0.80 for the verbal communication FOC.

Discussion

Conceptualizing raters as meaningfully idiosyncratic provides an alternative framework to explore the role of raters in the interpretation of the assessment scores, and goes beyond just considering them as a random source of variability. With this framework, the focus shifts from consistency to understanding the source of variability and the attributes of the raters. In this study, the overall reliability of the assessment was increased slightly by taking into account the rater role (eg, amount of exposure to trainee). This finding may suggest additional evidence toward requirement for a minimum amount of exposure prior to feedback to increase overall meaningfulness of the rating.

Despite the relatively limited finding in the current study due to limitations of the data, using the mixed-effects model may help explain some of the rater variability by taking systematic characteristics of the raters into account. Including these characteristics (eg, rater role, amount of contact with trainee) in the analysis may increase the overall rater reliability.

Second, the perspective that raters are meaningfully idiosyncratic suggests allowing for examination of variability rather than arbitrarily standardizing the ratings. If we find that nurses and physicians provide consistently different ratings, then, due to the differences in their environmental roles, the reliability estimates need to be able to represent these differences appropriately. By including these characteristics or differences as part of the rater reliability analysis, we could provide various subscores representing these different perspectives. Identification of the key rater background and factors attributing to the differences in rater perception will be critical to the application of this new reliability framework. Recent studies should spark further discussion and development in this area.8,9,15

This study has several limitations. First, the data used for illustration purposes may not be representative of the typical observational ratings encountered in the workplace. However, despite being a simulated case, our novel assessment format was developed to represent the complexity and the unpredictability of the typical clinical setting. This was done to maximize authenticity of the experience, as well as to simulate typical expectations of raters with often limited opportunities for direct observations. Second, the rater characteristic data available were limited to raters' specific role, which was a proxy for the amount of contact with the trainee. Additional background information about the raters would have provided a richer example and probably more significant results. Also, given the high reliability of the ratings, it was difficult to illustrate the maximum potential utility of the method using the current data. Lastly, for illustration purposes, the example was kept purposely simple by excluding the other facets in the model, such as cases and items. In future studies, the issue of case specificity and the relationship between cases and rater characteristics should be explored more in detail.

Conclusion

As we move toward competency-based education with increased emphasis on work-based and interprofessional assessments, we will need a new framework for considering rater reliability. Our approach to rater reliability may provide ways to maximize the information derived from the variability in raters.

Author Notes

Corresponding author: Christy K. Boscardin, PhD, UCSF School of Medicine, Department of Medicine, Office of Medical Education, 533 Parnassus Avenue, Suite U-80, San Francisco, CA 94143-3202, 415.519.3570, christy.boscardin@ucsf.edu

Funding: The authors report no external funding source for this study.

Conflict of interest: The authors declare they have no competing interests.

The authors would like to thank Pat O'Sullivan, EdD, for the helpful suggestions and feedback on the earlier versions of this paper.

Received: 22 Feb 2016
Accepted: 27 Jul 2016
  • Download PDF