SAN FRANCISCO - A deep-learning artificial intelligence (AI) algorithm can automatically summarize the key findings of radiology reports for x-ray images nearly as well as radiologists can, according to a Monday presentation at the Conference on Machine Intelligence in Medical Imaging (C-MIMI).
Researchers from Stanford University, led by Dr. Curtis Langlotz, PhD, developed a long short-term memory (LSTM) neural network that uses natural language processing to produce radiology impressions, or summaries of radiology reports. They found that their deep-learning algorithm performed better than similar baseline models overall and at least as well as actual radiologists for 67% of cases examined.
"Our deep learning-based sequence-to-sequence model to automatically summarize radiology report findings showed high lexical overlap with human-written summaries and good clinical validity," presenter Yuhao Zhang told meeting attendees.
Automated impressions
Previous research has shown that more than 50% of physicians examine only the impression section of radiology reports to inform their diagnosis, which underscores the importance of this work, Zhang noted. Yet generating radiology impressions is often time-consuming for radiologists and subject to human error. For instance, the repetitive nature of the task could lead radiologists to leave out relevant details unintentionally.
Seeking to improve the efficiency of preparing radiology impressions, Zhang and colleagues developed a deep-learning algorithm capable of producing impression statements by examining radiology reports of x-ray images. It relies on an LSTM network to generate statements (the output) based on previously dictated, free-text radiology findings (the input).
This process involves the following series of automated steps:
- Mapping the words used in the radiology findings to a vector representation -- a pattern that represents the semantic meaning of a word in relation to surrounding words
- Encoding these words into a fixed-size vector
- Decoding the vectors, i.e., using the vocabulary to predict a likely sequence of words that best summarizes the radiology findings
The deep-learning algorithm also includes a copy mechanism feature that enables it to imitate the vocabulary used by radiologists in the radiology findings, rather than produce the impression from scratch.
To train and test their model, the researchers obtained 82,127 radiography reports that radiologists completed at Stanford Hospital between 2000 and 2014. They first pretrained their model using a learning algorithm developed by Pennington et al (Global Vectors for Word Representation [GloVe], Stanford University); then they used 70% of the radiography reports for training their algorithm, 10% for validation, and 20% for testing.
Improved radiology report
Overall, their deep-learning model outperformed two baseline models on a set of metrics used to evaluate automated, machine-driven summarization known as the Recall-Oriented Understudy for Gisting Evaluation (ROUGE).
Comparison of deep-learning algorithms for generating radiology impressions | |||
Latent semantic analysis model | LexRank model | Stanford deep-learning model | |
ROUGE-1 | 29.4 | 30.5 | 46.5 |
ROUGE-2 | 16.3 | 17.1 | 33.4 |
ROUGE-L | 27.4 | 28.5 | 45 |
For qualitative assessment, a board-certified radiologist compared 100 radiology impressions written by humans with the corresponding ones prepared by the algorithm, without knowledge of the source. The radiologist determined that 51 of the impressions were equivalent in quality and indicated that the better impression was made by a human for 33 cases and by AI in 16 cases.
Most of the automated radiology impressions were grammatically correct, included "better" summaries for brief or negative studies, and were at least as good as human impressions for 67% of the reports, Zhang said.
The main limitations of the Stanford algorithm were its inability to analyze the grammaticality of the written statements and it occasionally leaving out salient findings from the impression. The algorithm also had difficulty writing appropriate impressions for extensively long radiology findings.
Recently, Zhang and colleagues published a paper through Stanford University that details updates to their algorithm, including a feature that allows it to take into account additional information available with x-ray images -- such as clinical indications -- to improve the quality of its impressions.
Looking ahead, the investigators plan to come up with new ways for their algorithm to recognize important findings, as well as improve the algorithm's clinical usability by training it to select and edit templates.