Weed out variability, bias in trial design

Sep 30, 2007

ARLINGTON, VA - Study end points -- such as sensitivity, specificity, and the area under the curve -- are crucial for gauging the effectiveness of an imaging modality being tested for application in a clinical trial. Equally important is what goes into those end points, according to Jeffrey Blume, Ph.D., from Brown University's Center for Statistical Sciences in Providence, RI.

Blume discussed protocol design from the perspective of diagnostic end points and the things that can foul them up, degrading the validity of radiologists' hard-earned trial results. Sources of variability and bias should be weeded out at the trial design stage, he said.

Blume, who said he prepared his talk with assistance from the statistical center's director Constantin Gatsonis, Ph.D., spoke on Thursday at the American College of Radiology Imaging Network (ACRIN) 2007 fall meeting.

The dominant paradigm for the clinical trial is based on therapeutic treatments -- e.g., how well a new drug improves disease outcomes. But diagnostic imaging trials, which occupy an intermediate step in the medical continuum -- the assessment of disease -- are quite different and in some ways better, he said.

Unlike therapeutic trials, patient outcomes in diagnostic trials, which include most imaging studies, can be affected by the course of treatment, Blume said. On the other hand, while a randomized controlled clinical design is considered the holy grail of therapeutic trials, diagnostic imaging trials can actually be more instructive when patients aren't randomized to one imaging modality or another.

When MRI is compared to CT for imaging a specific disease, for example, both can be performed in every patient, which eliminates variabilities from patient to patient and improves the validity of the results over the classic gold standard design. But there's more to it than just scanning everyone and publishing the results.

"We need to figure out what we want to measure, and it's not always necessarily the patient outcome, because we may just want to find out if we can see a disease in order to measure it," Blume said.

Important end points for diagnostic imaging trials include accuracy (diagnostic performance, predictive value) and diagnostic decision-making based on imaging results. Researchers want to know the specificity, sensitivity, and the area under the curve, which demonstrates the relationship between the two. And they will look for the population, institution, individual, or summary receiver operating characteristic (ROC) curves.

Conversely, in screening trials the accuracy of imaging tends to be less important than other end points such as positive and negative predictive values (i.e., if disease is present, have the disease?), survival (i.e., if an imaging finding leads to treatment, how long will the patient survive?), and time to recurrence (which is affected, among other things, by the choice of treatment).

"Screening trials often bridge the gap between diagnostic and therapeutic trials," Blume said.

Banishing bias

In considering the choice of end points, look for potential sources of bias, which are best eliminated in the design stage of the study, Blume said. The sources include verification bias (workup bias), especially if a study result depends on a test result or gold standard that is less than golden. If 50% of the cases testing positive were verified compared to only 20% of the cases testing negative, there is potential bias, he said.

Say a screening study result is verified by another test; everyone who tests positive goes to biopsy and pathology, and everyone whose test is negative is assumed to be negative. It leads to bias, which can be avoided by following up everyone who tests negative as the standard of care. If funds are insufficient for following up everyone who tests negative, maybe a subset of patients can be followed up, and the results extrapolated to the rest of the cohort, Blume advised.

Other sources of bias include interpretation bias, uninterpretable study bias, and temporal effects bias (i.e., radiologists improving their reading skills over the course of a study).

"It turns out that humans aren't very good to study in a sense because they get better every time they do a test," Blume said. "If you're running a clinical trial, that's a bad thing because all of your statistical models assume that the process is the same for every individual. If someone gets better over time, you really shouldn't be putting all that data together."

Two ways researchers mitigate this effect are by having all experts in the particular exam read the studies because they are unlikely to improve much over time, and by separating interpretations of the same data over weeks and months to reduce residual memory, he said.

Differing ascertainment of missing data is another source of bias, and there's a lot of missing data in clinical trials, Blume said. Statistically, if the data is missing at random, it doesn't bias the data. If there are patterns, however, such as a center that goes to great lengths to track down patients who disappear from follow-up while the other centers let them go, then the more aggressive center's data becomes less representative of all the centers, introducing bias into the study results.

Excluding uninterpretable tests is another source of bias. Blume said. This practice tends to overestimate the accuracy of an imaging modality in the study results because studies that can't be read aren't counted against the test's accuracy. In mammography, for example, a BI-RADS score of 0, meaning an uninterpretable study, has contributed to the difficulty of comparing different mammography trials and assessing the efficacy of the test.

Interpretation bias occurs when the reader knows the results of other tests for the disease. Studies should blind readers to this extra information if ethically possible, he said.

Ignoring interobserver variability can introduce bias. Two readers with equal overall accuracy may operate at different thresholds along the same ROC curve, and readers can each have different ROC curves, he said. Length time bias and lead time bias can skew screening results by making patients appear to live longer with a disease, or detect it at an earlier stage than would have occurred without screening.

Biomarkers are being used increasingly in trials for a number of measures, including prediction of risk (i.e., BRCA mutation), to identify disease in both symptomatic and asymptomatic individuals, Blume said. Prognostic biomarkers are used to predict disease outcomes at the time of diagnosis, predictive biomarkers are used to predict the outcome of a particular treatment, and monitoring biomarkers measure response to treatments, as well as detect relapses.

Vanquishing variability

Variability among radiologists, including variability in behavior, perception, and experience reading a study, affects the performance evaluation of the imaging test by adding variability, Blume said.

"When you do sample size projections you say, 'OK, this person's going to read with this ROC curve area at 90% or 85%,' " Blume said. "Once I define that number, the amount of statistical variability is set because it's a function of that number. And I know I need x number of patients to get the confidence levels to be narrow enough to make some conclusions."

If the expected reading level isn't what's actually occurring, due for example to less experienced readers, or behavioral changes, then the additional variability can render the confidence levels so wide as to be inconclusive, he said.

On a malignancy scale of 1 to 5, one reader might rate positives as anything from 3 to 5, while another rates positives as only 4 or 5. These kinds of issues, once again, have confounded the interpretation and comparison of mammography studies, Blume said. Averaging sensitivities and specificities can be misleading, he noted.

Of course, variability can also be dealt with by studying and incorporating it into the results, he said.

Another potential solution is to stop talking about sensitivity and specificity and focus on the ROC curve, Blume said. For example, reader A with 10% specificity and 90% sensitivity -- and reader B with 90% specificity and 10% sensitivity -- might be operating at different points on the same ROC curve.

When you tell a reader who is rating potential malignancies on a scale of 1 to 5 that a score of 3 should also be included as a potential malignancy, for example, you are actually teaching that reader to operate at a different point on the ROC curve, he said.

Besides, scales are actually a better way to do things, according to Blume. "You don't want to limit yourself to positive or negative," he said. "Get some continuous measurement or ordinal measurement (such as probability of malignancy), rather than just positive or negative that really gets into where people think this (patient) falls on the continuum from negative to positive. Then you can do an ROC analysis and … combine different readers together."

In the end, it may not be the technology that needs to be changed, but rather the reader's behavior that needs adjustment, Blume said.

Generalizability

The generalizability of results is another important issue in study design, one that requires careful consideration of the reader population, case mix (spectrum of disease), and technical characteristics of the imaging process, Blume said.

There is variation in reader experience between readers and institutions, and there are differences in the forms of disease included in a study. To minimize variability in the data, study protocols should be designed to minimize interinstitutional differences in participant populations, and define a common imaging technique that is very carefully followed by all the participating institutions.

Finally, protocols should provide for uniform disease assessment across participating institutions in a study, Blume said.

"If there's a finding on imaging, everyone's got to fill out the same form, and everyone has got to have the same process of workup," he said. "All of this has to be detailed. What's obvious at your institution isn't obvious at anyone else's. Clinical trials are all about uniformity, and that's what gives them generalizability."

Blume offered a list of questions to consider when considering protocols and end points: