Program Effects as a Source of Construct-Irrelevant Variance in ACGME Milestone Ratings

PhD and
PhD
Online Publication Date: 15 Aug 2025
Page Range: 434 – 438
DOI: 10.4300/JGME-D-24-00615.1
Save
Download PDF

The Accreditation Council for Graduate Medical Education (ACGME) Milestone rating system was designed to (1) track residents’ development of competence during training and (2) help program directors make decisions about readiness for unsupervised practice.1 In psychometric terms, the Milestone rating system acts like a measurement tool to provide numerical estimates of competence as learners progress. As such, it is important to be aware of factors that can affect the validity of Milestone ratings.2

Since 2015, approximately 3.8 million Milestone ratings have been submitted to the ACGME every 6 months.3 We have systematically conducted quantitative studies drawing on this dataset, using a variety of analytical models, as well as qualitative studies on the Clinical Competency Committee (CCC) processes that are involved in generating Milestones data.4-6 In doing so, we have identified specific challenges in the interpretation of Milestones data in some of the published literature. We have seen how programs differ widely in generating Milestones data, both within and across specialties and institutions, often reflecting differences in the culture of assessment practices and CCC processes rather than variations in true competence.7-11 Variance in Milestone ratings can arise from sources that are not directly relevant to learner competence, including rater bias or program culture. Both of these sources of variance are irrelevant in the sense that they are usually specific to an individual program and could potentially mask true differences in learner competence. An example of such masking would be the “straight-lining” that is observed in some programs, in which a resident is assessed with the same score across the milestone subcompetencies.8 By examining the variation in national Milestones data through a popular “validity framework,”12 we can systematically identify various sources of construct-irrelevant variance to improve the accuracy and utility of interpretations made from this research. In this article, we focus on program effects.

Relevant and Irrelevant Sources of Variance in Milestones Data

Figure 1a illustrates how differences between learners can be masked by the “program effect” (these data are from obstetrics and gynecology and are intended to illustrate the point only; data from other specialties show similar patterns). In this figure, the distribution of Milestone ratings for individual residents within each program is overlayed on the distribution of program means, which are ordered from lowest to highest. For the most part, program means are quite similar, but comparison of data from the ends of the distribution highlights the problem (see Figure 1b): individuals with high ratings in a program with lower means (eg, Programs 1-3) may be unfairly rated lower compared with individuals in other programs.

Figure 1aFigure 1aFigure 1a
Figure 1a Caterpillar Plot of Medical Knowledge Penultimate Milestone Ratings Across Programs

Citation: Journal of Graduate Medical Education 17, 4; 10.4300/JGME-D-24-00615.1

Figure 1bFigure 1bFigure 1b
Figure 1b Individual Resident Medical Knowledge Penultimate Milestone Rating for Top 3 and Bottom 3 Programs

Citation: Journal of Graduate Medical Education 17, 4; 10.4300/JGME-D-24-00615.1

Figure 2 illustrates differences between programs that are likely not based on differences in competence. In this figure, a strong trend from lower left to upper right would be expected if the certification examination (CE) and Medical Knowledge (MK) Milestone ratings were measuring the same underlying construct. However, as can be seen, there are programs whose MK Milestone ratings are substantially lower than expected given their CE scores (upper left quadrant) and programs with Milestone ratings substantially higher than expected compared to their CE scores (lower right quadrant). The former programs can be thought of as “stringent” in assigning Milestone ratings, while the latter could be considered “lenient.”

Figure 2Figure 2Figure 2
Figure 2 Mean Program-Level Medical Knowledge Penultimate Milestone Rating vs Program-Level Mean Certification Examination Scores for National Data in Obstetrics and Gynecology

Citation: Journal of Graduate Medical Education 17, 4; 10.4300/JGME-D-24-00615.1

Additional empirical evidence for program effects comes from both quantitative13 and qualitative studies, including surveys and structured interview studies with program directors.5,6,14,15

How to Address Program Effects—Implications for Future Research

Both quantitative and qualitative approaches may be useful in addressing program effects as potential confounders in any future research on Milestone ratings at the national level.

For the quantitative approach, we suggest using a group-centering approach,16,17 as outlined in a recent national-level study of Milestone ratings in vascular surgery.18 This involves subtracting the trainee’s Milestone rating from the residency program mean, which helps to control for program-level effects and potential bias in the group rating process.

Another approach to investigating the nature of program effects could involve choosing programs based on the caterpillar plot results in Figure 1 and interviewing selected programs at each end of the plot. Programs with high mean Milestones ratings could be checked for tendencies toward leniency in generating Milestone ratings. For example, if paired programs display unequal means in Milestone ratings but show equivalence in basic demographics (eg, program size, geographic region, faculty ratio, gender ratio) or CE results, it might be an indication of construct-irrelevant factors at play. Subsequent interviews or ethnographic observation and reflection may help discover “why” the observed differences are present.19

For the qualitative approach, we recommend utilizing semistructured interviews and ethnography to determine the impact of program-specific cultural and structural factors on the rating process. While individual faculty may vary in their approach to assessment and employ different thresholds for rating trainees, there are also important group processes at play within each CCC. Specifically, the qualitative work should focus on: (1) CCC processes; (2) program culture (eg, feasibility and acceptability of the Milestone rating process); (3) framing of evaluations for summative vs formative purposes; and (4) identification of reflective practices for improvement of the rating process (ie, continuous quality improvement).

Implications for the Program Director

For the program director, ideally, Milestone ratings should be “criterion-referenced” (ie, align with the wording of each Milestone paragraph). If the program director is confident in the validity of their local Milestone ratings, they could use the published predictive probability values (PPVs) of national data for their specialty to identify learners who are struggling early in their training.20 If necessary, they could then intervene by providing more focused practice and/or skill development or perhaps assign the learner to a supplementary rotation to get more experience on the skills that they are struggling with. More details on the development and rationale of PPV data can be found on the ACGME website and in various published work.21-26

In practice, we realize that not all CCC members will always refer to the Milestones language when making judgments about residents’ competence. At the individual program level, a group-centering approach would yield very little additional information. Perhaps the most effective way to reduce the construct-irrelevant variance that arises from inter-program variance would be to ask CCCs to refer to the Milestones language as closely as possible when assigning Milestone ratings, with the hope that other programs are doing the same. In the parlance of validity theory, this would be an example of using content validity to enhance response process validity.12

Discussion

Assessment of physician competence during graduate medical education is of paramount importance to ensure graduates are prepared to provide safe patient care in the unsupervised environment.27 A deeper understanding of construct-irrelevant variance, such as program effects, can help to apportion variation in Milestone ratings to variations in individual learner competence, thus enhancing the validity of the data. This in turn makes the available tools for intervention (eg, PPVs) more valuable.

The implications for graduate medical education are significant: when supported by validity evidence, Milestone ratings can be used to predict future adverse patient outcomes18 and identify trainees in need of remediation while still in the supervised environment.

A large and comprehensive analysis by Asch et al28 demonstrated that program structure and culture have an effect on future practice patterns of individual graduates. However, it is also clear that individuals vary in their progress toward competency during training. Milestone ratings—when adjusted for program effects—can go beyond the Asch study and also predict an individual’s future patient interactions.29 This type of data can provide program directors with tools to help struggling learners early in their training.20-26

We suggest that analyses that do not explicitly incorporate program effects are inherently limited, and this article provides some suggestions for taking these differences into account.

Copyright: 2025
Figure 1a
Figure 1a

Caterpillar Plot of Medical Knowledge Penultimate Milestone Ratings Across Programs


Figure 1b
Figure 1b

Individual Resident Medical Knowledge Penultimate Milestone Rating for Top 3 and Bottom 3 Programs


Figure 2
Figure 2

Mean Program-Level Medical Knowledge Penultimate Milestone Rating vs Program-Level Mean Certification Examination Scores for National Data in Obstetrics and Gynecology


Author Notes

Corresponding author: Stanley J. Hamstra, PhD, University of Toronto, Toronto, Ontario, Canada, stan.hamstra@utoronto.ca
  • Download PDF