A commercially available AI search engine boosted the performance of the latest ChatGPT model and provided another “leap forward” for the technology when tested on radiology board-style questions, according to a group at the University of Toronto.
A team led by Rajesh Bhayana, MD, tested Perplexity Pro with ChatGPT Turbo on 150 multiple-choice text-based questions that matched the style, content, and difficulty of the Canadian Royal College and the American Board of Radiology examinations. The model answered 90% of the questions correctly.
“Our findings illustrate the powerful potential of optimized [retrieval-augmented generation] systems in radiology,” the group wrote. The study was published October 8 in Radiology.
In previous studies, GPT-4 has performed well on radiology board-style exams, notwithstanding some illogical and inaccurate assertions, or hallucinations, according to the authors. Since then, GPT Turbo has been released, as well as Perplexity Pro, which features retrieval-augmented generation (RAG) technology, the authors explained.
RAG is an optimization technique that can ground the responses of large language models (LLMs) like ChatGPT Turbo in additional high-quality information. Used together, GPT Turbo generates responses while Perplexity retrieves relevant information to enhance those responses.
Using RAG to reduce LLM hallucinations in radiology could further enable impactful applications, such as a radiology copilot that accurately answers radiologist questions during reporting, the group hypothesized.
To test these capabilities, the researchers compared performances between Perplexity Pro with GPT-4 Turbo and ChatGPT-4 on the same 150 multiple-choice text-based radiology questions. Performance was assessed overall, by question type, and by topic, and compared using the McNemar test.
According to the findings, Perplexity with ChatGPT Turbo answered 90% (135 of 150) of questions correctly, substantially outperforming ChatGPT-4, which scored 79% (118 of 150). Further analysis showed that Perplexity with ChatGPT Turbo answered 92% of lower-order questions (56 of 61) and 89% of higher-order questions (79 of 89) while ChatGPT-4 scored a 79% on both of these subsets.
“Perplexity’s optimized web-based RAG enabled another leap forward in performance on a radiology board-style examination without images,” the group wrote.
While Perplexity is not radiology-specific, it prioritizes authoritative sources from the web and uses LLMs to optimize retrieval from these sources while forming responses. Radiology-specific systems enriched with the highest quality radiology resources could further reduce LLM hallucinations and improve performance for radiology use cases, the authors wrote.
However, board-style examination performance does not directly translate to clinical utility, they noted.
“Potential further improvements should be explored with high-quality radiology-specific data and for multimodal applications,” the group concluded.
The full study can be found here.