ChatGPT might be able to pass the current benchmark exam for qualification as a radiologist in the U.K., according to a group of musculoskeletal radiologists in Birmingham.
The group evaluated ChatGPT’s performance on a two-part test based on a bank of questions resembling the Fellowship of the Royal College of Radiologists (FRCR) exam. The large language model narrowly failed on part one true/false questions yet clearly passed part two single-best answer questions, noted lead author Sisith Ariyaratne, MD, of the Royal Orthopedic Hospital.
“GPT-4 performed at a reasonably high standard on the questions posed to it, which were of a similar standard to the text-based FRCR examination questions, demonstrating remarkable capability in sections requiring factual recall and understanding,” the group wrote, in an article published December 28 in Academic Radiology.
ChatGPT is an AI language tool developed by OpenAI that uses machine learning algorithms to comprehend and generate original text resembling human language. Prior studies have suggested that ChatGPT may be able to perform at a high standard on the Canadian Royal College and American Board of Radiology exams, the authors noted.
They also noted that ChatGPT has demonstrated an impressive ability to generate research articles resembling authentic human-written articles, although it has been shown that such articles can be factually inaccurate.
To further investigate the performance of ChatGPT, the researchers put the chatbot to the test on the FRCR exam, the current benchmark for qualification as a radiologist in the U.K. It is a rigorous exam designed to assess the knowledge and understanding of candidates across various facets of clinical radiology, the authors explained.
The Royal College of Radiologists (RCR) only publishes very limited sample questions, the authors noted. Thus, after discussions with recent successful candidates of the RCR exam, the group developed a mixture of questions from question banks that closely resembled the FRCR examination, they wrote.
Part one of the exam consisted of 203 five-part true/false physics questions, while part two consisted of 240 single best answer questions on a broad range of core curriculum that simulated the true length of the FRCR’s 2A exam. Questions were answered by both GPT-3.5 and GPT-4 models.
According to the findings, ChatGPT-4 answered 74.8% of part one true/false statements correctly, with the 2023 passing mark of part one of the FRCR exam being 75.5%.
“ChatGPT thus narrowly failed,” the group wrote.
On part two, ChatGPT 3.5 answered 50.8% SBAs correctly, while GPT-4 answered 74.2% correctly. The winter 2022 passing mark for part two of the FRCR exam was 63.3%, and thus GPT-4 clearly passed, the group added.
Ultimately, the study highlights some of the advanced capabilities of AI language models such as ChatGPT and how these abilities may play a potential role in medical education and health care in the future, the researchers wrote.
“It is reasonable to assume that further developments in AI will be more likely to succeed in comprehending and solving questions related to medicine, specifically clinical radiology,” the group concluded.
The full article is available here.