How well do LLMs interpret bone fracture textual CT reports?

Jul 9, 2025

Large language models (LLMs) show promise for analyzing radiology reports of bone fractures imaged on CT -- and thus helping radiologists to more quickly classify the injuries, according to a study published July 7 in the Journal of Imaging Informatics in Medicine.

The study findings suggest that LLMs, "when carefully validated, could reliably assist in the initial classification of fractures based on textual CT reports," wrote a team led by medical student Markus Mergen of the Technical University of Munich in Germany.

"While human oversight remains crucial, these models have the potential to streamline radiological workflows, especially for common fracture types," Mergen and colleagues noted.

Orthopedic injuries require precise classification to guide treatment, the group explained. This is typically done with the AO (Arbeitsgemeinschaft für Osteosynthesefragen) classification system, which characterizes fractures by bone, segment, and type and consists of three categories: Type A, "extra-articular/simple fractures that don't involve the joint surface"; Type B, "articular/partial fractures that involve part of the joint surface"; and Type C, "articular/complete fractures that completely 'disrupt' the joint surface."

Yet the AO system is subject to interobserver variability -- which is why AI could help in this particular clinical task, the investigators wrote, observing that previous research has suggested that image-based deep learning models show an accuracy range for detecting fractures of between 69% and 88% but that "the potential of LLMs to interpret textual radiology reports for AO classification remains unexplored -- a critical gap given the ubiquity of narrative-style CT/MRI reports in clinical workflows."

The study authors evaluated the performance of four different LLMs for classifying fractures based on the AO system from CT bone fracture exam reports. The LLMs included ChatGPT-4o, AmbossGPT, Claude 3.5 Sonnet, and Gemini 2.0 Flash. They used a dataset of 292 artificial, physician-generated CT reports that represented 310 fractures. Mergen and colleagues also created a real-life validation cohort that included 145 fractures from 141 radiology reports, analyzing these with the LLM LLaMA 3.3-70B.

The fractures in the fictitious reports were classified into proximal, diaphyseal, and distal regions of the bone (with the majority being in the proximal region, at 52.7%) and represented eight anatomical regions and type of fractures:

Femur (22.3%)
Forearm (30.6%)
Spine (7.7%)
Pelvis (4.2%)
Hand (4.2%)
Humerus (17.7%)
Lower leg (8.7%)
Scapula or clavicle (4.5%)

According to the AO framework, Type A fractures accounted for 36.1% of the total artificial reports, Type B for 29.4%, and Type C for 35.5%.

The team found that, in particular, ChatGPT-4o and AmbossGPT showed the highest overall accuracy.

LLM performance for analyzing fracture types on fictitious radiology reports
Measure	AmbossGPT	ChatGPT-4o	Claude 3.5 Sonnet	Gemini 2.0 Flash
Accuracy	74.3%	74.6%	69.5%	62.7%

The group also found the following:

There were statistically significant differences among the LLMs in the accuracy of fracture type classification. For example, AmbossGPT showed 100% accuracy for spinal fractures, outperforming the other LLMs, while Gemini 2.0 Flash underperformed in hand fractures (7.7%), and ChatGPT-4o showed strong performance in femur fractures (92.9%).
All models had strong bone recognition rates (90% to 99%), but accuracy in fracture subtype classification was lower (71% to 77%), "indicating limitations in nuanced diagnostic categorization," the team wrote.

Finally, the team reported LLaMA 3.3-70B performance results between the fictitious and real-world datasets, showing that the two were comparable.

LLaMA 3.3-70B performance comparing fictitious and real-world datasets
Measure	Fictitious dataset	Real-world dataset
Overall performance	69.7%	69.9%
Bone recognition	94.5%	98.6%
Bone part recognition	91.7%	93.8%
Fracture type recognition	73.8%	72.6%

The study findings are promising, but more research is needed, according to Mergen's group.

"Future studies should validate these findings on large, multi-center datasets of authentic clinical reports … [and] prospective studies are needed to assess how LLM-assisted workflows impact diagnostic speed, surgeon-radiologist communication, and patient outcomes in emergency settings," it concluded.

The complete study can be found here.