News Story
AI Model in Pediatric Radiology Maintains Age, Sex, & Maturity Biases, Univ. of Maryland Research Shows
Researchers from the University of Maryland School of Medicine (UMSOM), and Robert E. Fischell Institute for Biomedical Devices recently showed how a state-of-the-art artificial intelligence (AI) model used to measure bone age in pediatric populations exhibited sex-, age-, sexual maturity-, and race-based biases when applied to diverse populations. The research team, led by University of Maryland Medical Intelligent Imaging (UM2ii) Center Director and Fischell Institute faculty member Paul Yi, M.D., published their findings in Radiology.
“We evaluated an award-winning bone age AI model on a diverse dataset of children, which was not previously performed,” Yi said. “By comparing rates of error between demographic groups, we found clinically significant biases disadvantaging historically underrepresented groups. This is concerning because it demonstrates the potential for AI to perpetuate pre-existing health disparities.”
Bone age is a measure used to understand a pediatric patient’s skeletal maturity. This index is not only used to estimate a child’s adult height or timing for the onset of puberty, but it is also used by clinicians to help identify whether a child may have a hormonal imbalance, endocrine disease, or other potential health challenge.
To date, doctors typically determine a child’s bone age by taking an X-ray of the child’s wrist, hand, or finger. The doctors then compare their patient’s images to an atlas of X-ray images long considered representative of targets for specific ages and sexes. Next, the doctor makes a visual comparison of the patient’s image with these archived images in order to determine the closest match. If the patient’s images most closely match that of their corresponding sex and age group, then the patient is considered on track for skeletal maturity. If the patient’s image most closely matches what is considered standard for either a younger or older child of their sex, then the patient might be considered to have either delayed or early bone maturation, accordingly.
Nevertheless, the process for identifying pediatric bone age is tedious and at least somewhat subjective.
Given this, in 2017, the Radiological Society of North America hosted a Bone Age Algorithm Challenge that enlisted radiologists, data scientists, and collaborators. Together, teams of these researchers worked to develop an algorithm that could be used to automate the bone age assessment process either to replace or assist the radiologist’s need to perform a visual side-by-side. The award-winning group’s algorithm demonstrated exemplary performance when tested with the challenge dataset and when compared with three pediatric radiologist interpretations.
In fact, Yi and his group further verified this themselves. But, when the UMSOM group tested the algorithm’s performance in specific demographic groups, they found cause for concern.
“We saw that the algorithm actually presents a higher rate of clinically significant errors in the female group, as well as in kids who are either very young – essentially, newborns – or nearing adulthood,” Yi said. He and his team also noted that the algorithm presented borderline statistically significant differences in errors in Black and Hispanic pediatric populations (20 percent and 17 percent, respectively) compared with the rate of errors in White pediatric populations (14 percent).
Yi noted that the algorithm consistently demonstrated excellent performance in the generalized population. But, once applied to specific racial or other minority demographic groups, the results show that even cutting-edge algorithms have the potential to encode bias against minority or underrepresented populations.
“The fear is that algorithms may encode biases and performance disparities between different groups,” Yi said. “As AI models are increasingly used over time in health care, this can further perpetuate preexisting health disparities. What might seem to be a fairly small degree of bias in one dataset could, on a larger scale, make a real impact in health care.”
In many cases, the root cause for these disparities may not be the AI itself, rather, such disparities may relate to when and how the gold standards for bone age – or other health indicators – were established. For pediatric bone age, the gold standard is what’s known as the Greulich and Pyle (GP) atlas. The GP atlas is based on X-rays of hands and wrists taken in the 1930s and 1940s from about 1,000 predominantly White pediatric patients in Ohio.
In addition to the omission of different races and other demographic groups, there are other factors to consider. For example, nutrition and lifestyle factors have likely changed significantly since the GP standards were measured.
“In recent years especially, there has been a lot of discussion in medicine about how and what we define as ‘normal,’” Yi said, noting that, for decades, standard medical practice often involved adjusted calculations for certain lab values based on a person’s race. “What the medical community has found is that a lot of these calculations were based on biased and flawed studies that can be traced back to decades ago. Now, there’s growing acknowledgement that the medical community needs to reexamine a lot of what we once accepted as dogma.”
But, Yi and his team aren’t opposed to AI applications in health care; in fact, they are excited for its potential impact. One reason is that, when implemented properly, AI could enable clinicians to measure certain health care benchmarks – like pediatric bone health – more objectively.
“Although AI models have the potential for bias, they also have the benefit of being consistent in the predictions they provide – they don’t have variability in their predictions the same way that humans do, like how much sleep did you get, or how tired are you?” Yi said. “Because AI models don’t get tired, don’t sleep, etcetera, there’s a lot of potential for AI to positively impact medicine.
“But in order to reach this potential, we have to make sure that things like bias are not present in these models,” he continued. “The first step is awareness of the problem, which will then set the foundations for building solutions to mitigate bias. This is something we are working on next in our lab.”
Yi served as corresponding author of the Radiology article titled “Generalizability and Bias in a Deep Learning Pediatric Bone Age Prediction Model Using Hand Radiographs.” First author Elham Beheshtian and co-authors Kristin Putman, Samantha M. Santomartino, and Vishwa S. Parekh also contributed to this research.
Published November 2, 2022