Disparate VC results suggest 2D/3D debate isn't over

Dec 10, 2007

Two recent studies have given new life to an old rivalry: 2D versus 3D primary reading of virtual colonoscopy (VC or CT colonography [CTC]) data.

The development comes as something of a surprise in a debate that had pretty much ended in a draw a couple of years ago. Researchers came to agree that both methods had their strengths and weaknesses. Both primary 2D and 3D encounter lesions that are best seen using one or the other -- which is why when a colorectal polyp is found using one method, the reader is supposed to confirm the finding using the other method before reporting it.

Contributing to the peace are workstation developers who have incorporated both 3D endoluminal views and 2D multiplanar views into their products, making both methods accessible at the click of a mouse.

Finally, studies have shown that both methods can yield roughly equivalent sensitivity for detecting colorectal polyps, and these days radiologists are encouraged to use whichever primary reading method works best for them. Nevertheless, strong reader preferences have remained, and if the recent 2D/3D détente has succeeded in papering over disagreements over which method is best, it has certainly not eliminated them.

So the results from two new studies -- both produced from controlled trials in multiple screening centers examining a similar cohort of asymptomatic patients with a similar prevalence of adenomatous lesions -- are intriguing. Do the discordant findings point to the superiority of a single software package, a particularly talented group of readers, or some other factor? Inquiring radiologists want to know, but they may have to wait.

The studies

In the recent ACRIN 6664 trial led by Dr. C. Daniel Johnson from the Mayo Clinic in Rochester, MN, 2D versus 3D reading on multiple systems produced no statistically significant difference in lesion sensitivity. In fact, further analysis by the group's lead statistician showed a trend in favor of 2D.

However, a new study in the American Journal of Roentgenology by Dr. Perry Pickhardt and colleagues from the University of Wisconsin in Madison found that primary 3D reading on a single system produced substantially better outcomes compared to 2D reading by 10 experienced radiologists.

ACRIN analysis shows comparable sensitivity

At the 2007 RSNA meeting in Chicago, statistician Alicia Toledano Sc.D., from Brown University in Providence, RI, presented the results of her group's analysis of the data from the National CT Colonography Trial (ACRIN 6664), which recruited 2,600 asymptomatic participants at 15 centers across the U.S.

The data had been acquired using a thin-section low-dose acquisition, following automated CO₂ insufflation (ProtoCO₂l, E-Z-EM, Lake Success, NY). The results, presented but not yet published in a peer-reviewed medical journal, show a per-patient sensitivity of 93% for adenomas 1 cm and larger for virtual colonoscopy.

Patients were randomly assigned to a primary 2D or 3D read in the original trial, using several software platforms, which paved the way for further analysis.

Toledano's team, which compiled the statistics for the original study, did the follow-up analysis to evaluate the results through the lens of 2D versus 3D primary reading method -- to see if there were any meaningful differences based on which method was used.

"What we did was to randomly assign CTC exams to be read at the institute using either conventional (2D) display with 3D endoscopy for problem solving, or a primary 3D search method including the capability of displaying MPR (multiplanar reformatted) 2D images," Toledano said in her RSNA presentation. "The images were then redistributed during the trial for independent review using the other method. So each patient's exam was reviewed in 2D by one radiologist and in 3D by another radiologist." The radiologists were blinded to the colonoscopy results.

The results showed that sensitivities for primary 2D and primary 3D review within local reads were similar, Toledano said. Both ranged from 0.50 to 1.00 with averages of 0.91 ± 0.04 for 2D and 0.88 ± 0.04 for 3D.

"The radiologists performed very well when reviewing their own cases," she said. "Eleven of 15 had perfect reading sensitivity."

Another finding of the study was actually more surprising: Accuracy was reduced when radiologists reviewed data from other centers. Reread sensitivities ranged from 50% to 100%, averaging 91% ± 0.04. Sensitivities for primary 2D review within rereads ranged from 0.00 (of one case reviewed) to 1.00, average 0.87 ± 0.07; for primary 3D review, the range was 0.38 to 1.00, average 0.92 ± 0.05.

"Eight of 15 radiologists had perfect test sensitivity looking at someone else's cases in 2D, and seven of 14 looking at someone else's cases in 3D, but the estimated sensitivity (82% in 2D and 79% in 3D) more often than not was less than it was when someone was reading his or her own cases," Toledano said.

The similarity in average sensitivities between local reads and rereads of someone else's cases supported pooling of the results, she said. Overall, sensitivities for primary 2D review ranged from 0.50 to 1.00, with average 88% ± 0.04; the same as primary 3D review. Seven radiologists had higher sensitivity for primary 2D review, four for primary 3D review, and four had equal sensitivity for both. Results for specificity also supported pooling, yielding an average of 86% ± 0.01 for primary 2D review (range: 72% to 92%) and average 85% ± 0.02 for primary 3D review (range: 0.61 to 0.92). Overall sensitivity was 79% for primary 3D review and 82% for primary 2D review.

Using confidence intervals and standard errors that allow for variation across radiologists, the group also quantified for uncertainty because these are a sample of patients from a large population and a sample of radiologists from a larger population, and "that's important because we know from past experience that this variation exists," Toledano said.

Results for specificity also supported pooling, which yielded an average of 86% ± 0.01 for primary 2D review (range: 72% to 92%) and an average of 85% ± 0.02 for primary 3D review (range: 61% to 92%).

"If I look across all four measures -- sensitivity, specificity, positive predictive value, and negative predictive value -- I do see some differences in favor of 2D, including 7% sensitivity advantage, 3% specificity advantage, 5% advantage in positive predictive value, and similar negative predictive values," she said. "But none of these differences is statistically significant, and that's shown by the confidence intervals."

Whether read in 2D or 3D, virtual colonoscopy has a high sensitivity for detecting subjects with colorectal neoplasms 10 mm in diameter or larger, Toledano concluded. "It's a win-win situation," she said.

"Both primary 3D and primary 2D reading methods differ in sensitivity in favor of primary 2D, but because we have a limited number of patients in this prospective screening study (n = 2,531 cases completed) and a limited number of patients with large neoplasms (n = 109 ≥ 10 mm), and because performance varies across radiologists, the standard error is 8%, the p value > 0.05, and that is not statistically significant," Toledano said.

AJR study logs better results for 3D

A key difference between the U.S. Department of Defense (DoD) CTC trial, which averaged 93% sensitivity for detecting lesions 10 mm and larger, and later multicenter VC trials (Cotton et al, Rockey et al) that produced far lower detection sensitivities was the primary 3D reading method used in the DoD trial, Pickhardt and colleagues wrote in their December 2007 AJR paper. However, they noted, there is no direct evidence supporting the superiority of either reading method.

"To more directly test the relative importance of 2D versus 3D displays in polyp detection, we had experienced CTC reviewers evaluate a large consecutive subset of cases from the DoD CTC screening trial using a primary 2D approach," the group wrote. "The 2D results were compared with the published results from primary 3D evaluation performed by less experienced reviewers and with the published results from the three primary 2D CTC trials" (AJR, December 2007, Vol. 189:4, pp. 774-779).

In the study, 10 radiologists who were blinded to the results retrospectively interpreted 730 consecutive colonoscopy-proven CTC cases in asymptomatic adults using a primary 2D approach, with 3D used only for problem solving.

The group compared 2D performance on updated software (V3D Colon version 2.0, Viatronix, Stony Brook, NY) with the original primary 3D detection results from the original DoD trial of 1,233 asymptomatic adults published in 2003, using an earlier version of the software (New England Journal of Medicine, December 4, 2003, Vol. 349:23, pp. 2191-2120).

Using primary 2D MPR views for the retrospective reads, the 10 readers completed their 2D cases in a mean time of 6.7 minutes, compared to 19.6 minutes for the original 3D reads performed by less experienced readers. The 2D readers had each completed more than 100 cases before starting the study. Still, Pickhardt told AuntMinnie.com, they were free to take as much time as they needed for each case.

"The 2D reads were under no time pressure, unlike the prospective 3D reads (in the original DoD trial), where patients went on to immediate optical colonoscopy -- making it a very pressured reading," Pickhardt wrote in an e-mail. "The 2D reads were completed during academic or free time, and not while on a busy clinical service. These days in the clinic, 3D reads average well under 10 minutes, he said (ACRIN trial reading times were quite long: a mean of 25.5 minutes for 3D versus 19.4 minutes for primary 2D).

Pickhardt's results showed that primary 2D sensitivity for adenomas 6 mm and larger was 44.1% (56/127), compared with 85.7% (180/210) at 3D VC (p < 0.001). For adenomas 10 mm and larger, sensitivity of 2D VC was 75.0% (27/36), compared with 92.2% (47/51) at 3D reading (p = 0.027).

The authors reported similar sensitivity trends in a per-patient analysis and for all polyps at both the 6- and 10-mm thresholds. Per-patient specificity for 2D evaluation at the 10-mm threshold was 98.1% (676/689), compared with 97.4% (1,131/1,161) at 3D evaluation (p = 0.336).

The retrospective nature of the study was an important limitation, the authors noted, while citing a number of other factors that served to strengthen the study results.

"The size of the screening population evaluated makes this the largest primary 2D CTC study to date," they wrote. "The strict inclusion and exclusion criteria ensure a true screening population; the low prevalence of disease provides a more rigorous and relevant evaluation compared with a polyp-rich cohort. The use of same-day optical colonoscopy with segmental unblinding of CTC results provides an enhanced reference standard. The primary 2D reviewers were significantly more experienced in CTC interpretation than the original 3D reviewers, which should have favored the 2D results. Finally, the results from the prospective 3D interpretation allow a more direct comparison of the two interpretation techniques."

Primary 2D VC is less sensitive than primary 3D for polyp detection in low-prevalence screening cohorts, the group concluded. "The disappointing 2D sensitivity in this study was very similar to results obtained with primary 2D evaluation in previous CTC trials."

Why the difference?

How could the results from two similar cohorts have come to such different conclusions? Was it the readers, the different 3D reading software, or some other factor that made the difference? In the DoD trial results that constituted the 3D arm of their AJR study, Pickhardt and colleagues used an early version of V3D Colon, while the National CT Colonography trial relied on several vendors for primary 3D interpretation.

According to Toledano, the reviews for the ACRIN trial were performed on five different platforms: half of the cases used software from Vital Images (Minnetonka, MN), more than a quarter used GE Healthcare (Chalfont St. Giles, U.K.), 14% used Siemens Medical Solutions (Malvern, PA), 8% used Viatronix, and 1% used TeraRecon (San Mateo, CA). The choice of software platforms was left to each individual research site, and reflected the diversity that can be expected in larger clinical practice, she noted.

Pickhardt said he wasn't surprised that 3D didn't perform as well in the ACRIN trial

"It's simple to explain really," he wrote in an e-mail to AuntMinnie.com. "If you lower the 3D performance (by using substandard 3D software) and somewhat artificially elevate the 2D performance (by only reading one to two cases in a sitting and taking nearly 20 minutes each), then the difference between 2D and 3D will be less than what we found in our study. My understanding is that the 3D performance of the few ACRIN sites using Viatronix was substantially higher than the other sites, which also supports this notion. The fact that the average 3D reading time in ACRIN was so long also suggests that the software used was not up to par."

Toledano told Pickhardt after her talk that 3D results would have been better if more sites had used the Viatronix package. Still, she said, the contributions of any particular software package are far from settled.

Further study is planned to try to tease out a difference between software manufacturers, Toledano said in her talk. In addition to further analyzing the difference between local reads and rereads, "we're also going to be looking at what happens from software manufacturer to manufacturer -- trying to see if there's some magic of the Viatronix (software)," Toledano said. And the group will do its best to separate software effects from other factors that may be influencing the results, she noted.

Dr. Abraham Dachman from the University of Chicago in Illinois asked if varying skills of different reader groups might have made a statistical difference in the results. After all, he said, the original DoD study relied on just four readers, who could have constituted an unusually skilled group compared to the 15 readers in the recent ACRIN data analysis. Would the results have been different if different readers had examined the results of both trials?

"When we made our confidence intervals, we did try to account for the fact that our radiologists are a sample of radiologists from a larger population," Toledano said. For this reason, similar results could be expected if a new group of readers were plugged into the ACRIN analysis, she said, cautioning that her group had not analyzed the Pickhardt data.

By Eric Barnes
AuntMinnie.com staff writer
December 11, 2007

Study: Primary 3D VC equivalent to colonoscopy, September 12, 2007

Debate over 2D versus 3D VC reveals subtle differences, January 1, 2005

Group credits 3D reading for best-ever VC results, October 15, 2003

Virtual colonoscopy: 2D vs. 3D primary read, June 3, 2002