Bayesian trial in the real world
This post arose from a discussion on Twitter about a recently-published randomised trial. Twitter isn’t the best forum for debate so I wanted to summarise my thoughts here in some more detail.
What was interesting about the trial was that it used a Bayesian analysis, but this provoked a lot of reaction on Twitter that seemed to miss the mark a bit. There were some features of the analysis that some people found challenging, and the Bayesian methods tended to get the blame for that, incorrectly in my view.
First, a bit about the trial. It’s this one:
Laptook et al Effect of Therapeutic Hypothermia Initiated After 6 Hours of Age on Death or Disability Among Newborns With Hypoxic-Ischemic Encephalopathy. JAMA 2017; 318(16): 1550-1560.
This trial randomised infants with hypoxic ischaemic encephalopathy who were aged over 6 hours to cooling to 33.5 C for 96 hours (to prevent brain injury) or no cooling. Earlier studies have established that cooling started in the first 6 hours after birth reduces death and disability, so it is plausible that starting later might also help, though maybe the effect would be smaller. The main outcome was death or disability at 18 months.
The methodological interest here is that they used a Bayesian final analysis, because they felt that they would only be able to recruit a restricted number of infants, and a Bayesian analysis would be more informative, as it can quantify the probability of the treatment’s benefit, rather than giving the usual significant/non-significant = works/doesn’t work dichotomy.
The main outcome occurred in 19/78 in the hypothermia group and 22/79 in the no hypothermia group. Their analysis used three different priors, a neutral prior (centred on RR 1.0), an enthusiastic prior, centred on RR 0.72 (as found in an earlier trial of hypothermia started before 6 hours), and a sceptical prior, centred on RR 1.10. The 95% interval for the neutral prior was from 0.5 to 2.0, so moderately informative.
The results for the Bayesian analysis with the neutral prior that were presented in the paper were: an adjusted risk ratio of 0.86, with 95% interval from 0.58 to 1.29, and 76% probability of the risk ratio being less than 1.
OK, that’s the background.
Here are some unattributed Twitter reactions:
“This group (mis)used Bayesian methods to turn a 150 pt trial w P=0.62 into a + result w/ post prob efficacy of 76%!”
“I think the analysis is suspicious, it moves the posterior more than the actual effect size in study, regardless which prior chosen
Primary outcome is 24.4% v 27.9% which is RR of 0.875 at best. Even with a weak neutral prior, should not come up with aRR to 0.86
Also incredibly weak priors with high variance chosen, with these assumptions, even a n=30 trial would have shifted the posterior.”
There were some replies from Bayesian statisticians, saying (basically) no, it looks fine. The responses were interesting to me, as I have frequently said that Bayesian methods would help clinicians to understand results from clinical trials more easily. Maybe that’s not true! So it’s worth digging a bit into what’s going on.
First, on the face of it 19 versus 22 patients with the outcome (that’s 24.4% versus 27.8%) doesn’t look like much of a difference. It’s the sort of difference that all of us are used to seeing described as “non-significant,” followed by a conclusion that the treatment was not effective or something like that. So to see this result meaning a probability of benefit of 76% might look as if it’s overstating the case.
Similarly, the unadjusted risk ratio was about 0. 875, but the Bayesian neutral-prior analysis had RR=0.86; so it looks as though there has been some alchemy in the Bayesian analysis to increase the effect size.
So is there a problem or not? First, the 76% probability of benefit just means 76% posterior probability (based on the prior, model and data) that the risk ratio is less than 1. There’s quite a sizeable chunk of that probability where the effect size is very small and not really much of a benefit, so it’s not 76% probability that the treatment does anything useful. The paper actually reported the probability that the absolute risk difference was >2%, which was 64%, so quite a bit lower.
Second, 76% probability of a risk ratio less than 1 also means 24% probability that it is more than 1, so there is a fairly substantial probability that the treatment isn’t beneficial at all. I guess we are more used to thinking of results in terms of “working” or “not working” and a 76% probability sounds like a high probability of effectiveness.
Third, the point estimate. The critical point here is that the results presented in the paper were adjusted estimates, using baseline measures of severity as covariates. The Bayesian analysis with neutral prior centred on 1 would in fact pull the risk ratio estimate towards 1; the reason the final estimate (0.86) shows a bigger effect than the unadjusted estimate (0.875) is the adjustment, not the Bayesian analysis. The hypothermia group was a bit more severely affected than the control group, so the unadjusted estimate is over-conservative (too near 1), and the covariate adjustment has reduced the risk ratio. So even when pulled back towards 1 by the neutral prior, it’s still lower than the unadjusted estimate.
Another Twitter comment was that the neutral prior was far too weak, and gave too much probability to unrealistic effect sizes. The commenter advocated using a much narrower prior centred on 1, but with much less spread. I don’t agree with that though, mainly because assuming such a prior would be equivalent to assuming more data in the prior than in the actual trial, which doesn’t seem sensible when it isn’t based on actual real data.
The other question about priors is what would be a reasonable expectation based on what we know already? If we believe that early starting of hypothermia gives a substantial benefit (which several trials have found, I think), then it seems totally reasonable that a later start might also be beneficial, just maybe a bit less so. The results are consistent with this interpretation – the most probable risk ratios are around 0.85.
Going further, the division into “early” or “late” starting of hypothermia (before or after 6 hours of age) is obviously artificial; there isn’t anything that magically happens at 6 hours, or any other point. Much more plausible is a decline in effectiveness with increasing time to onset of hypothermia. It would be really interesting and useful to understand that relationship, and the point at which it wasn’t worth starting hypothermia. That would be something that could be investigated with the data from this and other trials, as they all recruited infants with a range of ages (in this trial it was 6 to 24 hours). Maybe that’s an individual patient data meta-analysis project for someone.