OpenAI’s latest large language model (LLM) OpenAI o3 scored 90% on Japan’s licensing exam for radiologic technologists, and drafted a mock exam that was considered up to par by experts.
The performance was good enough to consider the model for generating mock exams for training programs, noted the study’s lead author Toshimune Ito, PhD, of Teikyo University in Tokyo, and colleagues.
“OpenAI o3 can generate radiological licensure items that align with national standards in terms of difficulty, factual correctness, and blueprint coverage,” the group wrote. The study was published November 13 in JMIR Medical Education.
Most items written for mock exams for training programs are by instructors who draw on past examinations or personal clinical experience, which can result in biases in content coverage, inconsistencies in wording, and variable educational usefulness, according to the authors.
Given that LLMs have achieved high accuracy on radiology medical exams in previous studies, the researchers hypothesized that they could be useful for educating radiologic technologists.
First, they tested four LLMs -- OpenAI o3, o4-mini, o4-mini-high (OpenAI), and Gemini 2.5 Flash (Google) -- on all 200 multiple-choice items on the 2025 Japanese National Examination for Radiological Technologists. The outputs of the model were compared to the official answer key, with correct and incorrect responses counted overall for the 200 items and separately for the 173 items that did not require image interpretation.
OpenAI o3 performed the best, with a score of 90% overall and a score of 92% on nonimage items, significantly outperforming o4-mini on the full set (p = 0.02). Across models, accuracy differences on the nonimage subset were not significant, the researchers reported.
Next, the group fed OpenAI o3 the official exam specifications and files of the last five years of exam items and prompted the model to generate 192 original items across 14 subjects. Image-based items were excluded, since all models performed poorly on those, the researchers noted.
Radiology faculty experts then rated each item on a 5-point scale: (1) unacceptable; (2) major revision needed; (3) revisable; (4) minor revision; and (5) adoptable across five criteria, including item difficulty, factual accuracy, accuracy of content coverage, appropriateness of wording, and instructional usefulness.
The model received high expert ratings for item difficulty (mean, 4.3), factual accuracy (4.2), and content coverage (4.7), although ratings were lower for appropriateness of wording (3.9) and instructional usefulness (3.6).
“Although the AI-generated questions fell short in terms of wording clarity and pedagogical feedback, these can be mitigated through targeted editorial review," the group wrote. "Practically speaking, LLMs can be used to draft content that is eventually refined by the faculty."
Future advancements in high-resolution visual encoders and medical-specific tuning will be required to close the performance gaps on the image-based items, the researchers noted. Also, adaptive feedback functions and automated blueprint mapping could further extend their educational value.
“After overcoming these barriers in terms of technical improvements and reproducibility safeguards, LLMs can be a strong asset in radiological technology education, which can even extend to the licensure preparations of other allied health professionals worldwide,” the researchers concluded.
The full study is available here.



















