Four commercially available natural language processing (NLP) tools for chest x-ray report annotation show high overall accuracy but exhibit significant age-related bias, according to a study published October 22 in Radiology.
The models – CheXpert, RadReportAnnotator, ChatGPT-4, and cTAKES – were between 82.9% and 94.3% accurate labeling x-ray reports for thoracic diseases, but performed poorly in patients over 80 years old, noted lead author Samantha Santomartino, a medical student at Drexel University in Philadelphia, and colleagues.
“While NLP tools can facilitate [deep learning] development in radiology, they must be vetted for demographic biases prior to widespread deployment to prevent biased labels from being perpetuated at scale,” the group wrote.
NLP is a set of automated techniques for analyzing written text and commercial models that employ the technology may offer an alternative for curating large imaging datasets for deep-learning AI development, the authors explained. However, without robust evaluation for bias, NLP and the AI tools developed from it may perpetuate existing healthcare inequities related to socioeconomic factors, they wrote.
In this study, the researchers tested the four NLP tools on a subset of the Medical Information Mart for Intensive Care (MIMIC) chest x-ray dataset (balanced for representation of age, sex, and race and ethnicity; n = 692) and the entire Indiana University (IU) chest x-ray dataset (n = 3,665).
Three board-certified radiologists annotated the images for 14 thoracic disease labels. NLP tool performance was evaluated using several metrics, including accuracy and error rate, while bias was evaluated by comparing performance between demographic subgroups.
ChatGPT-4 and CheXpert achieved accuracies of 94.3% and 92.6% on the IU dataset, while RadReportAnnotator and ChatGPT-4 led in accuracy on the MIMIC dataset, with values of 92.2% and 91.6%, according to the findings.
However, all four tools exhibited demographic biases across age groups in both datasets, with the highest error rates (mean, 15.8% ± 5 [SD]) in patients older than 80 years.
“Because NLP forms the foundation for imaging dataset annotations, biases in these tools may explain biases observed in deep-learning models for chest radiographic imaging diagnosis,” the researchers wrote.
Ultimately, algorithmic biases can be mitigated by making training data more diverse and representative of the population and NLP tools should be trained on contemporary data to ensure that they are representative of current demographic trends, the researchers wrote.
“Debiasing algorithms during training through techniques such as fairness awareness and bias auditing may help mitigate biases,” they suggested.
The full article is available here.