Passing a Technical Skills Examination in the First Year of Surgical Residency Can Predict Future Performance
ABSTRACT
Background
The ability of an assessment to predict performance would be of major benefit to residency programs, allowing for early identification of residents at risk.
Objective
We sought to establish whether passing the Objective Structured Assessment of Technical Skills (OSATS) examination in postgraduate year 1 (PGY-1) predicts future performance.
Methods
Between 2002 and 2012, 133 PGY-1 surgery residents at the University of Toronto (Toronto, Ontario, Canada) completed an 8-station, simulated OSATS examination as a component of training. With recently set passing scores, residents were assigned a pass/fail status using 3 standards setting methods (contrasting groups, borderline group, and borderline regression). Future in-training performance was compared between residents who had passed and those who failed the OSATS, using in-training evaluation reports from resident files. A Mann-Whitney U test compared performance among groups at PGY-2 and PGY-4 levels.
Results
Residents who passed the OSATS examination outperformed those who failed, when compared during PGY-2 across all 3 standard setting methodologies (P < .05). During PGY-4, only the contrasting groups method showed a significant difference (P < .05).
Conclusions
We found that PGY-1 surgical resident pass/fail status on a technical skills examination was associated with future performance on in-training evaluation reports in later years. This provides validity evidence for the current PGY-1 pass/fail score, and suggests that this technical skills examination may be used to predict performance and to identify residents who require remediation.
Introduction
Competency-based surgical education is gaining momentum around the world due to its aim to ensure that surgeons achieve the necessary skills to provide safe patient care.1–4 The ability to predict competence would have major implications on resident selection, promotion, and certification.5
While a surgeon is expected to achieve competence in several domains, technical skills remain a key component for surgical specialties. Simulated environments have been used for technical skills training and have demonstrated transferability of skills to the operating room.6,7 However, simulated performance data to date have not been used to predict performance. Among the tools to assess technical skills,8 1 of the most widely used is the Objective Structured Assessment of Technical Skills (OSATS),9 which has been implemented across a variety of specialties.9–12 One of the limitations of the original OSATS examination was its lack of a pass score, limiting its use in pass/fail decisions.13 Furthermore, there were no data, to our knowledge, investigating the predictive ability of this examination. Recently, pass scores have been set for the original OSATS examination, allowing residents to be assigned a pass/fail status.14 That status used data from 513 postgraduate year 1 (PGY-1) surgical residents collected over a 10-year period to set the pass score for the OSATS examination with 3 standard setting methodologies (contrasting groups [CG], borderline group [BG], and borderline regression [BR]).14
One way to build further validity evidence for the OSATS examination is to demonstrate the predictive ability of the recently set OSATS pass score.9,15 If passing or failing the OSATS examination predicts future residency performance, it not only builds validity evidence for the pass scores but also, from a practical standpoint, it could help in the early identification and remediation of underperforming trainees.
To that end, the purpose of this study was to build evidence of validity for the recently set OSATS passing scores, hypothesizing that passing the OSATS examination predicts improved future technical skills of surgical residents.
Methods
The University of Toronto (Ontario, Canada) has administered the OSATS examination to all PGY-1 surgical residents since the early 2000s. Data have been collected from all surgical residents who have taken this 8-station, simulation-based, technical skills examination since its initiation. Only raw scores have been assigned, as a passing score had not been set until recently.
A recent study used this database to set passing scores for the OSATS examination with 3 standard setting methodologies: the CG method, the BG method, and the BR method.14 Passing scores were then used to retrospectively assign a pass/fail status to all general surgery residents (N = 133) who had taken the OSATS examination between 2002 and 2012.14
The current study used the pass/fail status of the 133 surgery residents to compare future residency performance between those who had passed and those who failed the OSATS. Future performance was assessed using retrospectively collected, in-training evaluation reports (ITERs) from residents' training files, capturing data from their PGY-2 and PGY-4. The ITER data were collected from all surgical rotations and completed by multiple raters.
While the ITERs include data on multiple domains of competence, our study used only data specific to technical skills with items rated on a 5-point Likert scale. A technical skills score was established for each resident during his or her PGY-2 and PGY-4 by calculating a mean score out of 5 from all of the technical skills points on their PGY-2 and PGY-4 ITERs.
A Mann-Whitney U test compared the technical skills score during PGY-2 and PGY-4 between residents who passed and residents who failed the OSATS using the 3 standard setting methodologies.
The Research Ethics Board at St Michael's Hospital (Toronto, Ontario, Canada) approved this study.
Results
Data from the ITERs were available on 109 PGY-2s and 76 PGY-4s. The Kolmogorov-Smirnov test of normality demonstrated a deviation from normal (P < .05); therefore, the nonparametric Mann-Whitney U test was used.
The majority of PGY-2s (n = 63, 58%) had data from 2 ITERs (range, 1–3) and the majority of PGY-4s (n = 63, 83%) had 2 or 3 ITERs (range, 1–4). Each ITER contributed multiple data points for calculating a PGY-2 and PGY-4 technical skills score, respectively.
At PGY-2, a statistically significant difference was seen between residents who passed and those who failed the OSATS, according to all 3 standard-setting methods (CG, BG, BR). Those who passed outperformed those who failed (Mann-Whitney U test; CG, z = 3.49, P < .001; BG, z = 2.50, P = .012; BR, z = 2.09, P = .037; table 1). At the PGY-4 level, this statistically significant difference was still present using the CG method (Mann-Whitney U test; z = 2.58, P = .010; table 2; figure).



Citation: Journal of Graduate Medical Education 9, 3; 10.4300/JGME-D-16-00517.1
Discussion
This study demonstrates that PGY-1 residents' pass/fail status on the OSATS has the potential to predict future performance, with failing residents being more likely to underperform based on ITER data during their PGY-2. As time passes, the ability to predict performance becomes more difficult, as more variables influence outcomes; despite that, the pass/fail status using the CG methodology continued to predict performance in PGY-4, showing a statistically significant difference between groups. The loss of statistical significance in PGY-4 for the BG and BR methods does not discount them as useful or credible standard-setting methods; rather, this study highlights the limitation of distant prediction and the need for continuous assessment throughout training.
Progression within a surgery program often relies on ITER evaluations, which are poor at identifying residents with below-average technical skills.16 Implementing an objective assessment of technical skills early in surgical training may be instrumental in identifying underperformers and introducing early educational interventions for effective remediation, and could help to address the failure to fail phenomenon.17,18
The OSATS examination, originally developed as a technical skills assessment,19 was used in the present study to investigate the predictive ability of an objective, standardized, performance-based assessment. However, while the focus was on surgery trainees, the results of this study could be of interest to a broader surgical audience, as the OSATS has been widely adopted across other surgical specialties8,13,20,21 and anesthesiology.22 Furthermore, the OSATS, as a performance-based assessment, parallels the objective structured clinical examination,23 which assesses clinical skills and has been used in nontechnical specialties, including internal medicine24,25 and family medicine.26 While this study focused on surgery, it provides foundational work for further predictive studies in other technical and nontechnical specialties.
Previous reports suggested ITERs are poor at identifying residents with below-average technical skills.16 In contrast, our results suggest that ITER scores can discriminate resident technical performance. We found that a failing score on the OSATS in PGY-1 was associated with significantly poorer technical skills in PGY-2, based on ITER data. This difference was maintained in PGY-4 using the CG methodology. The absolute difference in median ITER scores, however, was small. The median ITER scores for failing residents ranged from 3.83 to 3.94 in PGY-2 and from 3.67 to 3.78 in PGY-4. In contrast, the median ITER scores for passing residents ranged from 4.17 to 4.22 in PGY-2 and from 4.00 to 4.10 in PGY-4. This suggests that a score of 3 (scale midpoint), with a descriptor of competent, may be overestimating performance at that level. This rightward shift of the assessment scale is consistent with the existing literature that ITER data are typically heavily biased toward competent. Despite that bias, the present study was still able to show a difference in ITER scores between groups. Given that ITER evaluations are already well established in many training programs, it is important to recognize this upward shift when interpreting an individual resident's ITER.
In contrast to the ITER, the OSATS has accrued a wealth of validity evidence for the interpretation of its scores.8,13,19 However, its use in high-stakes decisions has been limited due to the lack of an established passing score.13 Setting pass scores and investigating the effect of pass/fail status addresses the “implication or decisions” component of the Kane27 model of validity. This domain of validity has been neglected in the OSATS validation literature and is an essential component if the OSATS is to be considered for high-stakes decisions.9 Until recently, few studies have addressed the issue of pass/fail scores for OSATS type of examinations, typically, with a pass/fail decision based on overall dichotomous pass/fail judgment, rather than by applying standard-setting methodologies.12,28,29 Moreover, no study, to our knowledge, has looked at the implications of OSATS pass/fail results.9 The present study builds on the implication or decisions validity argument by demonstrating the predictive ability of the OSATS pass/fail status on future performance; this not only builds validity evidence for the OSATS but also builds validity evidence for the recently set pass scores. This component of validity is also essential for considering the use of OSATS in high-stakes assessments, such as promotion or certification.20
The use of technical skills simulation to assess and predict future performance in the workplace is a relatively new concept. Traditionally, simulation has been used as an adjunct to teach technical skills, flattening the learning curve inside the operating room with studies demonstrating the transfer of skills acquired in the laboratory to the operating room.6,7 However, data on simulation to assess and predict performance are limited. Moore et al,30 used a simulated technical skills assessment during residency selection to predict performance during residency, demonstrating a moderate correlation, but did not use a dichotomous pass/fail status, limiting the ability to identify a failing cohort that would be at risk of future difficulties. The advantage of the present study is its ability to dichotomize the group into passing and failing cohorts using evidence-based passing scores, allowing for the identification of the group that would benefit from early remediation.
This study has 2 limitations. One is the use of ITER data, which have been criticized for being poor at identifying below-average residents.16 However, while the reliability of the ITER can be low with a single rater and a single evaluation, aggregated ITER data (as used in our study) with multiple evaluators and across multiple rotations have been shown to have good reliability and predictive validity.31–33 The second limitation is its retrospective nature. Future research will explore the ability of a pass/fail status to predict intraoperative performance and patient outcomes. Further work will also include the development of remedial strategies for underperforming residents.
Conclusion
This study demonstrated the ability of a simulated performance-based assessment to predict future skills. A key implication of these findings is the potential for early identification and remediation of the underperforming resident.

Comparing Technical Skills Scores at Postgraduate Year 2 (PGY-2) and PGY-4 Levels Between Residents Who Passed and Failed the OSATS Examination During PGY-1
Note: Determined with (a) a contrasting groups method; (b) a borderline group method; and (c) a borderline regression method.
Author Notes
Funding: This research was partly funded by the Society for Surgery of the Alimentary Tract Career Development Award for Clinical/Outcomes/Education Research.
Conflict of interest: The authors declare they have no competing interests.
These results were presented at the American College of Surgeons Clinical Congress, Chicago, Illinois, October 2015. These data were previously published as part of that meeting, as an abstract in the Journal of the American College of Surgeons: de Montbrun S, Grantcharov T. Passing the Objective Structured Assessment of Technical Skills (OSATS) examination predicts future technical skills performance in surgical trainees. J Am Coll Surg. 2015;221(suppl 4):53–54.



