A Checklist to Help Faculty Assess ACGME Milestones in a Video-Recorded OSCE
ABSTRACT
Background
Faculty members need to assess resident performance using the Accreditation Council for Graduate Medical Education Milestones.
Objective
In this randomized study we used an objective structured clinical examination (OSCE) around the disclosure of an adverse event to determine whether use of a checklist improved the quality of milestone assessments by faculty.
Methods
In 2013, a total of 20 anesthesiology faculty members from 3 institutions were randomized to 2 groups to assess 5 videos of trainees demonstrating advancing levels of competency on the OSCE. One group used milestones alone, and the other used milestones plus a 13-item checklist with behavioral anchors based on ideal performance. We classified faculty ratings as either correct or incorrect with regard to the competency level demonstrated in each video, and then used logistic regression analysis to assess the effect of checklist use on the odds of correct classification.
Results
Thirteen of 20 faculty members rated assessing performance using milestones alone as difficult or very difficult. Checklist use was associated with significantly greater odds of correct classification at entry level (odds ratio [OR] = 9.2, 95% confidence interval [CI] 4.0–21.2) and at junior level (OR = 2.7, 95% CI 1.3–5.7) performance. For performance at other competency levels checklist use did not affect the odds of correct classification.
Conclusions
A majority of anesthesiology faculty members reported difficulty with assessing a videotaped OSCE of error disclosure using milestones as primary assessment tools. Use of the checklist assisted in correct assessments at the entry and junior levels.
Introduction
The implementation of milestone-based assessments by the Accreditation Council for Graduate Medical Education (ACGME) creates a need for residency programs to provide faculty members with training and tools to make these assessments.1 Each specialty has developed milestones or subcompetencies based on the 6 ACGME competencies for periodic assessment of trainee performance.2–4 Faculty members evaluate trainees' performance using the milestones, which now replace traditional global faculty assessments. Little is known about the manner in which faculty are trained to make milestone-based assessments, and whether use of milestone-based tools will improve the quality of faculty assessments.
We assessed whether use of a checklist would improve assessments of milestones by anesthesiology faculty at 3 institutions. We developed an objective structured clinical examination (OSCE) scenario around the disclosure of an adverse outcome to a standardized patient (SP), which is a patient care milestone in anesthesiology. Several residency programs use SPs for teaching this activity and have developed milestones for managing errors.5,6
Methods
In 2013, we e-mailed a description of the study to 20 faculty members from the Education and Clinical Competency Committees of the anesthesiology departments at the University of Alabama at Birmingham (UAB), Vanderbilt University, and Albany Medical Center.
Two authors (L.J.E. and M.L.W.) wrote the scenario for the OSCE: a resident is asked to make a postoperative visit to a female patient who experienced an adverse event (a loose tooth after a difficult intubation). In a 10-minute encounter, the resident must discuss the event, educate the patient about her difficult intubation, and counsel her for future surgery. This scenario is intended to allow the faculty member to assess 5 milestones in the competencies of patient care, professionalism, interpersonal and communications skills, practice-based learning and improvement, and systems-based practice. To demonstrate content validity, faculty members at UAB and Vanderbilt University reviewed the scenario, and provided feedback as to the ideal observable behaviors based on the literature and their institutions' protocols for managing medical errors, including disclosure to patients.7–9
We created an itemized checklist with behavioral anchors for each of 13 items similar to checklists that are used in UAB and Vanderbilt University simulation centers (provided as online supplemental material). The scale for assessment was adequate, inadequate, or did not observe. We developed the milestone assessment tool around the 5 subcompetency milestones selected as the focus of the scenario (provided as online supplemental material).
We recorded five 10-minute videos of the OSCE scenario set at advancing levels of training and competency: entry (prior to first year of residency); junior (prior to subspecialty training); midlevel (subspecialty training); senior (ready to graduate); and advanced (aspirational). Three Vanderbilt trainees (a medical student, a resident, and a fellow) participated by performing the 5 roles. Trainees complied with the institutional consent process for creating videos. The same SP performed the role of the patient in all 5 videos.
We used a video capture system for medical simulation (B-line Medical, Washington, DC) and placed all 5 videos into a password-protected website randomizing the order of viewing.
Participants were randomized to 2 groups with 10 faculty in each group. Both groups used the milestone assessment tool but 1 group (N = 10) used the checklist in addition to the milestone assessment tool. Each participant received assessment instructions and tools, a description of the OSCE scenario, and a survey. The faculty at 1 institution viewed the videos as a group and completed all 5 assessments before discussion. Faculty at the other 2 institutions viewed and assessed the videos without group discussion. Participants completed an 8-question survey about their teaching experience, prior exposure to OSCEs and milestones, and ease of using the tools.
The Institutional Review Board at Vanderbilt University granted this project exempt status.
Statistical Analysis
All participants viewed and scored each of the 5 video performances. Each participant provided a score (entry, junior, mid, senior, or advanced) for each of the 5 milestones, an overall performance rating, and required level of support rating. We classified video ratings for each milestone, as well as overall performance, as either correct or incorrect. We analyzed the data for all raters, videos, and milestone/competency simultaneously using logistic regression to estimate the odds of correct classification, adjusting for milestone/competency, training level portrayed in the video, checklist use by the faculty rater, interaction of portrayed training level and checklist use, and interaction of milestone/competency and checklist use.
We used the interactions to assess whether the effects of checklist use varied by portrayed training level or milestone competency. For each of the portrayed training levels, the odds ratio associated with checklist use was presented with 95% confidence interval (CI). We used a likelihood ratio (LR) “chunk” test to assess the significance of explanatory variables and their interactions. We omitted nonsignificant interactions from the final regression model. Using postassessment survey data, a 95% CI (Wilson score methods) was created for the proportion of participants who felt that the checklist aided them in picking the appropriate milestone. We generated interrater reliability statistics for the 13 checklist items using Fleiss' kappa statistic to determine the degree of agreement in each checklist item as a measure of interrater reliability.10
Results
Twenty faculty members (18% of 110 total anesthesiology faculty at the 3 institutions) participated in the study. Five of the 20 faculty members previously had assessed a learner in an OSCE, 11 had been OSCE participants, and 7 had prior training in the use of milestones. When asked about the difficulty of using the milestones as a tool for OSCE assessment, 13 of 20 felt that it was difficult or very difficult. Of that group using the checklist as a tool, only 4 of 10 found it difficult and 6 of 10 (95% CI 3.1–8.3) felt that it aided them to pick the appropriate milestones (provided as online supplemental material).
Table 1 shows the counts of observed checklist items (adequate, inadequate, did not observe) for the 10 faculty members who used the checklist. As the portrayed level of performance increased, the number of observations rated as adequate increased (see numbers in boldface [and shading] in Table 1). Participants scored 2 items (recognizes when to involve/defer to supervisor and checks for understanding) more often as inadequate or did not observe. Interrater agreement (Table 2) for 3 items showed substantial agreement, 2 items showed moderate agreement, and 5 items showed fair or slight agreement, beyond the level of agreement that is expected due to chance. Three items did not show any agreement.
We classified video ratings as either correct or incorrect according to level of training portrayed in the video. Averaging across all video performances (ie, ignoring the interaction of checklist use and video training level), the odds of correct classification were greater by a factor of 1.4 (95% Cl 1.0–2.0) when the checklist was used. However, there was significant evidence of interaction between checklist use and video training level (LR test P value < .001); the effectiveness of checklist use was inversely related to the training level portrayed in the video. The improvement in classification was largest for the entry level performance. Table 3 lists the odds ratio of correct classification associated with checklist use, stratified by video training level. For example, the odds of correctly classifying the entry level video were increased by a factor of 9.2 (95% CI 4.0–21.2) with checklist use. The milestone category being rated (eg, patient care, professionalism) was not significantly associated with the odds of correct classification (LR test P value = .35), nor was there evidence of an interaction with checklist use (LR test P value = .68). Table 4 lists the percentage (count) of correct video ratings, stratified by training level and checklist use.
Despite randomization, there was a degree of imbalance across study groups in years of experience and prior experience in assessing a learner in an OSCE. To address the possibility of chance confounding by these factors, we performed a sensitivity analysis in which we additionally adjusted for the years of experience category and prior OSCE assessment experience on the odds of correctly rating the portrayed performance level, as well as their interaction with checklist use. Although experience in OSCE assessment was found to be positively associated with the odds of correct video rating, the effect of checklist use was robust after adjustment for these factors.
Discussion
This study demonstrates a faculty development exercise designed to compare the use of milestones alone as a tool and the use of a checklist, the conventional tool. Although we assumed that the faculty with an itemized checklist would choose the correct milestone more often, this was only true for the performances portrayed at the entry and junior levels. In all other cases, the use of the checklist added no advantage.
Use of a checklist is the most common method to assess OSCE performance.11,12 Others have noted that it is more difficult to observe expertise using an OSCE examination, especially with a binary scale (adequate/inadequate) checklist,13,14 and there is evidence that it is best to use global assessments or entrustable professional activities (EPAs) when working with more advanced learners.15–18 Videos of standardized performances by trainees have been used in other studies for the purpose of setting standards, training faculty, and determining reliability of assessments.19–21 Although we developed 5 videos for this study, for subsequent faculty and resident training sessions involving time constraints, we used 1 junior and 1 advanced video for assessment and discussion with positive results. The faculty participants in this study commented that viewing the video performances and assessing with the tools provided was an effective introduction to the milestone concept and performance assessment.
There are limitations to this study, including its small sample, which may reduce generalizability to faculty who did not participate. The videos were not piloted in advance.
We will be creating new OSCE stations based on EPAs and milestones that are difficult to assess, giving trainees and faculty live opportunities to practice and assess using the actual subcompetency milestones as the assessment tool. Videos of individual performances will be used for classroom use and standard setting. The goal of this research is to improve the quality of faculty assessment of trainees in actual clinical care.22,23
Conclusion
In this study, faculty members were able to accurately assign milestones in most cases to a video performance. A checklist aided the assessment of entry level and junior resident performers. Global or EPA-based assessments may be more effective for more advanced trainees.
Author Notes
Funding: The authors report no external funding source for this study.
Conflict of interest: The authors declare they have no competing interests.
These results were presented at the Society for Education in Anesthesiology 29th Spring Meeting, Boston, Massachusetts, May 30–June 1, 2014, and at Gerald S. Gotterer Health Professions Education Research Day, Vanderbilt University Medical Center, Nashville, Tennessee, September 22, 2014.
The authors would like to thank Ms Martha Tanner for her editorial assistance and Dr Mark Rice for his guidance in preparing the manuscript.
Editor's Note: The online version of this article contains the faculty survey results, the disclosure objective structured clinical examination (OSCE) checklist, and the milestone/OSCE video evaluation tool.



