November 01, 2017

Bayesian trial in the real world

This post arose from a discussion on Twitter about a recently-published randomised trial. Twitter isn’t the best forum for debate so I wanted to summarise my thoughts here in some more detail.

What was interesting about the trial was that it used a Bayesian analysis, but this provoked a lot of reaction on Twitter that seemed to miss the mark a bit. There were some features of the analysis that some people found challenging, and the Bayesian methods tended to get the blame for that, incorrectly in my view.

First, a bit about the trial. It’s this one:
Laptook et al Effect of Therapeutic Hypothermia Initiated After 6 Hours of Age on Death or Disability Among Newborns With Hypoxic-Ischemic Encephalopathy. JAMA 2017; 318(16): 1550-1560.

This trial randomised infants with hypoxic ischaemic encephalopathy who were aged over 6 hours to cooling to 33.5 C for 96 hours (to prevent brain injury) or no cooling. Earlier studies have established that cooling started in the first 6 hours after birth reduces death and disability, so it is plausible that starting later might also help, though maybe the effect would be smaller. The main outcome was death or disability at 18 months.

The methodological interest here is that they used a Bayesian final analysis, because they felt that they would only be able to recruit a restricted number of infants, and a Bayesian analysis would be more informative, as it can quantify the probability of the treatment’s benefit, rather than giving the usual significant/non-significant = works/doesn’t work dichotomy.

The main outcome occurred in 19/78 in the hypothermia group and 22/79 in the no hypothermia group. Their analysis used three different priors, a neutral prior (centred on RR 1.0), an enthusiastic prior, centred on RR 0.72 (as found in an earlier trial of hypothermia started before 6 hours), and a sceptical prior, centred on RR 1.10. The 95% interval for the neutral prior was from 0.5 to 2.0, so moderately informative.

The results for the Bayesian analysis with the neutral prior that were presented in the paper were: an adjusted risk ratio of 0.86, with 95% interval from 0.58 to 1.29, and 76% probability of the risk ratio being less than 1.

OK, that’s the background.

Here are some unattributed Twitter reactions:

“This group (mis)used Bayesian methods to turn a 150 pt trial w P=0.62 into a + result w/ post prob efficacy of 76%!”

“I think the analysis is suspicious, it moves the posterior more than the actual effect size in study, regardless which prior chosen
Primary outcome is 24.4% v 27.9% which is RR of 0.875 at best. Even with a weak neutral prior, should not come up with aRR to 0.86
Also incredibly weak priors with high variance chosen, with these assumptions, even a n=30 trial would have shifted the posterior.”

There were some replies from Bayesian statisticians, saying (basically) no, it looks fine. The responses were interesting to me, as I have frequently said that Bayesian methods would help clinicians to understand results from clinical trials more easily. Maybe that’s not true! So it’s worth digging a bit into what’s going on.

First, on the face of it 19 versus 22 patients with the outcome (that’s 24.4% versus 27.8%) doesn’t look like much of a difference. It’s the sort of difference that all of us are used to seeing described as “non-significant,” followed by a conclusion that the treatment was not effective or something like that. So to see this result meaning a probability of benefit of 76% might look as if it’s overstating the case.

Similarly, the unadjusted risk ratio was about 0. 875, but the Bayesian neutral-prior analysis had RR=0.86; so it looks as though there has been some alchemy in the Bayesian analysis to increase the effect size.

So is there a problem or not? First, the 76% probability of benefit just means 76% posterior probability (based on the prior, model and data) that the risk ratio is less than 1. There’s quite a sizeable chunk of that probability where the effect size is very small and not really much of a benefit, so it’s not 76% probability that the treatment does anything useful. The paper actually reported the probability that the absolute risk difference was >2%, which was 64%, so quite a bit lower.

Second, 76% probability of a risk ratio less than 1 also means 24% probability that it is more than 1, so there is a fairly substantial probability that the treatment isn’t beneficial at all. I guess we are more used to thinking of results in terms of “working” or “not working” and a 76% probability sounds like a high probability of effectiveness.

Third, the point estimate. The critical point here is that the results presented in the paper were adjusted estimates, using baseline measures of severity as covariates. The Bayesian analysis with neutral prior centred on 1 would in fact pull the risk ratio estimate towards 1; the reason the final estimate (0.86) shows a bigger effect than the unadjusted estimate (0.875) is the adjustment, not the Bayesian analysis. The hypothermia group was a bit more severely affected than the control group, so the unadjusted estimate is over-conservative (too near 1), and the covariate adjustment has reduced the risk ratio. So even when pulled back towards 1 by the neutral prior, it’s still lower than the unadjusted estimate.

Another Twitter comment was that the neutral prior was far too weak, and gave too much probability to unrealistic effect sizes. The commenter advocated using a much narrower prior centred on 1, but with much less spread. I don’t agree with that though, mainly because assuming such a prior would be equivalent to assuming more data in the prior than in the actual trial, which doesn’t seem sensible when it isn’t based on actual real data.

The other question about priors is what would be a reasonable expectation based on what we know already? If we believe that early starting of hypothermia gives a substantial benefit (which several trials have found, I think), then it seems totally reasonable that a later start might also be beneficial, just maybe a bit less so. The results are consistent with this interpretation – the most probable risk ratios are around 0.85.

Going further, the division into “early” or “late” starting of hypothermia (before or after 6 hours of age) is obviously artificial; there isn’t anything that magically happens at 6 hours, or any other point. Much more plausible is a decline in effectiveness with increasing time to onset of hypothermia. It would be really interesting and useful to understand that relationship, and the point at which it wasn’t worth starting hypothermia. That would be something that could be investigated with the data from this and other trials, as they all recruited infants with a range of ages (in this trial it was 6 to 24 hours). Maybe that’s an individual patient data meta-analysis project for someone.

October 18, 2017

Language, confidence intervals and beliefs

People often speak and write about values of treatment effects outside their confidence intervals as being “excluded.” For example; “the risk ratio for major morbidity was 0.98 (95% CI 0.91, 1.06), which excluded any clinically important effects.” I just made that up but you often see and hear similar statements. What understanding do people take from it? There are two possible interpretations.

First, the straightforward meaning that clinically important values are outside the confidence interval. This is using “exclude” just as the opposite of “include” to make a statement about what is and isn’t inside the confidence interval.

But there is another interpretation, or another layer of interpretation, which I suspect is very common, and results from the meaning of “exclude” as something a bit stronger. Dictionary definitions give things like “to keep out, reject or not consider, shut or keep out,” which have a sense that excluding something is actively rejecting it. Using that word may therefore give the impression that the values outside the confidence interval can be discounted or ruled out. That is too strong a conclusion. Those values may be less compatible with the data, but that alone doesn’t make them unlikely or implausible.

I guess this is a similar issue to the use of “significant” and “confidence”; the word brings extra connotations.

September 21, 2017

Best sample size calculation ever!

I don't want to start obsessing about sample size calculations, because most of the time they're pretty pointless and irrelevant, but I came across a great one recently.

My award for least logical sample size calculation goes to Mitesh Patel et al, Intratympanic methylprednisolone versus gentamicin in patients with unilateral Meniere's disease: a randomised, comparative effectiveness trial, in The Lancet, 2016, 388: 2753-62.

The background: Meniere's disease causes vertigo attacks and hearing loss. Gentamicin, the standard treatment, improves vertigo but can worsen hearing. So the question is whether an alternative treatment, methylprednisolone, would be better - as good in reducing vertigo, and better in terms of hearing loss. That's actually not what the trial did though - it had frequency of vertigo attacks as the primary outcome. You might question the logic here; if gentamicin is already good at reducing vertigo, you might get no or only a small improvement with methylprednisolone, but methylprednisolone might not cause as much hearing loss. So you want methylprednisolone to be better at reducing hearing loss, as long as it's nearly as good as gentmicin at reducing vertigo.

Anyway, the trial used vertigo as its primary outcome, and recruited 60 people, which was its pre-planned sample size. But when you look at the sample size justification, it's all about hearing loss! Er... that's a completely different outcome. They based the sample size of 60 people on "detecting" a difference of (i.e. getting statistical significance if the true difference was) 9dB (sd11). Unsurprisingly, the trial didn't find a difference in vertigo frequency.

This seems to be cheating. If you're going to sign up to the idea that it's meaningful to pre-plan a sample size based on a significance test, it seems important that it should have some relation to the main outcome. Just sticking in a calculation for a different outcome doesn't really seem to be playing the game. I guess it ticks the box for "including a sample size calculation" though. Hard to believe that the lack of logic escaped the reviewers here, or maybe the authors managed to convince them that what they did made sense (in which case, maybe they could get involved in negotiating Brexit?).

Here's their section on sample size, from the paper in The Lancet:



September 13, 2017

Confidence (again)

I found a paper in a clinical journal about confidence intervals. I’m not going to give the reference, but it was published in 2017, and written by a group of clinicians and methodologists, including a statistician. Its main purpose was to explain confidence intervals to clinical readers – which is undoubtedly a worthwhile aim, as there is plenty of confusion out there about what they are.

I think there is an interesting story here about what understanding people take away from these sorts of papers (of which there are quite a number), and how things that are written that are arguably OK can lead the reader to a totally wrong understanding.

Here’s the definition of confidence intervals that the authors give:

“A 95% confidence interval offers the range of values for which there is 95% certainty that the true value of the parameter lies within the confidence limits.”

That’s the sort of definition you see often, and some people don’t find problematic, but I think most readers will be misled by it.

The correct definition is that in a long series of replicates, 95% of the confidence intervals will contain the true value, so it’s kind-of OK to say that a 95% CI has a “95% probability of including the true value,” if you understand that means that “95% of the confidence intervals that you could have obtained would contain the true value.”

Where I think this definition goes wrong is in using the definite article: “THE range of values for which there is 95% certainty…” That seems to be saying pretty clearly that we can conclude that there is a 95% probability that the true value is in this specific range. I’m pretty sure that is what most people would understand, and the next logical step is that if there is 95% probability of the true value being in this range, if we replicate the study many times, we will find a value in this range 95% of the time.

That’s completely wrong – the probability that the parameter is in a 95% CI varies depending exactly where in relation to the true value the CI falls. If you’ve got a CI that happens to be extreme, the probability of getting a replicated parameter in that range might be very low. On average it’s around 83.6% (see Cumming & Maillardet 2006, ref below).

The problem is that “95% probability of including the true value” is a property of the population of all possible confidence intervals, and unless we are very careful about language, it’s easy to convey the erroneous meaning that the “95% probability” applies to the one specific confidence interval that we have found. But in frequentist statistics it doesn’t make sense to talk about the probability of a parameter taking certain values; the parameter is fixed but unknown, so it is either in a particular confidence interval or it isn’t. That’s why the definition is as it is: 95% of the possible confidence intervals will include the true value. But we don’t know where along their length the true value will fall, or even whether it is in or out of any particular interval. It’s easy to see that “95% probability of the location of the true value” (which seems to be the interpretation in this paper) can’t be right; replications of the study will each have different data and different confidence intervals. These cannot all show the location of the true value with 95% certainty; some of them won’t even overlap!

What the authors seem to be doing, without realising it, is using a Bayesian interpretation. This is no surprise; people do it all the time, because it is a natural and intuitive thing to do, and many probably go through an entire career without realising that this is what they are doing. When we don’t know a parameter, it is natural to think of our uncertainty in terms of probability – it makes sense to us talk about the most probable values, or a range of values with 95% probability. I think this is what people are doing when they talk about 95% probability of the true value being in a confidence interval. They are imagining a probability distribution for the parameter, with the confidence interval covering 95% of it. But frequentist confidence intervals aren’t probability distributions. They are just intervals.

I guess this post ought to have some nice illustrations. I might add some when I’ve got a bit of time.

Cumming, G., Maillardet, R. Psychological Methods 2006, Vol. 11, No. 3, 217–227

August 19, 2017

Trial results infographics

There is a fashion for producing eye-catching infographics of trial results. This is a good thing in some ways, because it’s important to get the results communicated to doctors and patients in a way they can understand. Here’s one from the recent WOMAN trial (evaluating tranexamic acid for postpartum haemorrhage).

WOMAN trial 2

What’s wrong with this? To my mind the main problem is that if you reduce the messages to a few headlines then you end up leaving out a lot of pretty important information. One obvious thing missing from these results is uncertainty. We don’t know, based on the trial’s results, that the number of women bleeding to death would be reduced by 30% – that’s just the point estimate, and there’s substantial uncertainty about this.

Actually the reduction by 30% isn’t the trial’s main result, which has the risk ratio for death due to haemorrhage as 0·81, 95% CI 0·65–1·00. So that’s actually a point estimate reduction of 19%, with a range of effects “consistent with the data” (or not significantly different from the data) of a reduction between 35% and zero. The 30% reduction seems to come from a subgroup analysis of women treated within 3 hours of delivery. A bit naughty to use a subgroup analysis as your headline result, but this highlights another problem with the infographic – you don’t really know what you’re looking at. In this case they have chosen to present a result that the investigators presumably feel represents the real treatment effect – but others might have different views, and there isn’t any way of knowing that you’re seeing results that have been selected to support a particular story.

[I’m guessing that the justification for presenting the “<3 hour” subgroup is that there wasn’t a clear effect in the “>3 hour” subgroup (RR 1.07, 95% CI 0.76, 1.51), so the belief is that treatment needs to be given within 3 hours to be effective. There could well be an effect of time from delivery, but it neds a better analysis than this.]

WOMAN trial: Lancet, Volume 389, No. 10084, p2105–2116, 27 May 2017

PS And what’s with the claim at the top that the drug could save 1/3 of the women who would otherwise die from bleeding after childbirth? That’s not the same as 30%, which wasn’t the trial’s result anyway. I guess a reduction of 1/3 is a possible result but so are reductions of 25% or 10%.

July 18, 2017

The future is still in the future

I just did a project with a work experience student that involved looking back through four top medical journals for the past year (NEJM, JAMA, Lancet and BMJ), looking for reports of randomised trials. As you can imagine, there were quite a lot - I'm not sure exactly how many because only a subset were eligible for the study we were doing. We found 89 eligible for our study, so there were probably at least 200 in total.

Of all those trials, I saw only ONE that used Bayesian statistical methods. The rest were still doing all the old stuff with null hypotheses and significance testing.

May 30, 2017

Hospital–free survival

One of the consequences of the perceived need for a “primary outcome” is that people try to create a single outcome variable that will include all or most of the important effects, and will increase the incidence of the outcome, or in some other way allow the sample size calculation to give you a smaller target. There has for some time been a movement to use “ventilator-free days” in critical care trials, but a recent trend is for trials of treatments for cardiac arrest to use “hospital-free survival” or “ICU-free survival,” defined as the number of days that a trial participant was alive and not in hospital or ICU, up to 30 days post randomisation.

A recent example is Nicholl et al (2015), who compared continuous versus interrupted chest compressions during CPR. It was a massive trial, randomising over 23,000 participants, and found 9% survival with continuous compressions and 9.7% with interrupted. Inevitably this was described as “continuous chest compressions during CPR performed by EMS providers did not result in significantly higher rates of survival.” But it did result in a “significant” difference in hospital-free survival, which was a massive 0.2 days shorter in the continuous compression group (95% CI -0.01, -0.03, p=0.004).
A few comments. First, with a trial of this size and a continuous outcome, it’s almost impossible not to get statistical significance, even if the effect is tiny. As you can see. I very much doubt that anyone would consider an improvement in hospital-free survival of 0.2 days (that’s about 4 hours 48 minutes) of any consequence, but it’s there in the abstract as a finding of the trial.

Second, it’s a composite outcome, and like almost all composite outcomes, it includes things that are very different in their importance; in this case, whether the patient is alive or dead, and whether they are in hospital. It’s pretty hard to interpret this. What does a difference of about 5 hours in the time alive and out of hospital mean? Would a patient think that was a good reason to use the intervention? I doubt it. They would surely be more interested in the chances of surviving, and maybe secondarily whether the amount of time they might spend in hospital would be different.

Third, and this is especially true for cardiac arrest trials, the mean is a terrible way to summarise these data. The survival rate in this trial was about 9%. The vast majority of deaths would have occurred either before reaching hospital or in hospital, so all of those patients would have hospital-free survival of zero. The 9% or so of patients that survived to hospital discharge would have a number of hospital-free days between 0 and 30. So the means for each group will be pulled strongly towards zero by the huge number of participants with zero hospital-free days. The means for each group are presented in Table 3 of the paper, as 1.3 ± 5.0, and 1.5 ± 5.3, without comment, even though that seems to imply negative hospital-free survival. Definitely a case here for plotting the data to see what is going on; the tabulated summary is inadequate. The difference is almost certainly driven by the 0.7% higher survival in the interrupted compression group, which was possibly an important finding. However, because it was non-significant it is pretty much ignored and assumed to be zero.

Nichol G et al. Trial of Continuous or Interrupted Chest Compressions during CPR. NEJM 2015; 373: 2203-2214.

April 07, 2017


altman slide

Here’s a photo of a slide from a talk by Doug Altman about hypothesis tests and p-values recently (I nicked the picture from Twitter, additions by me). I wasn’t there so I don’t know exactly what Doug said, but I totally agree that hypothesis testing and p-values are a massive problem.

Nearly five years ago (July 2012 if I remember correctly) I stood up in front of the Warwick Medical School Division of Health Sciences, in a discussion about adopting a “Grand Challenge” for the Division, and proposed that our “Grand Challenge” should be to abandon significance testing. The overwhelming reaction was blank incomprehension. There was a vote for which of the four or five proposals to adopt, and ONE PERSON voted for my idea (for context, as a rough guess, there were probably 200-300 people in the Division at that time).

It was certainly well-known before 2012 that hypothesis tests and p-values were a real problem, but that didn’t seem to have filtered through to many medical researchers.

March 05, 2017

Sample size statement translation

Here are a couple of statements about the justification of the sample size from reports of clinical trials in high-impact journals (I think one is from JAMA and the other from NEJM):

We estimated that a sample size of 3000 … would provide 90% power to detect an absolute difference of 6.3 percentage points in the rate of [outcome] between the [intervention] group and the placebo group.

The study was planned to detect a difference of 1.1 points in the [outcome score] between the 2 interventions with a significance level of .05 and a power level of 90%.

There is nothing remarkable about these at all; they were just the first two that I came across in rummaging through my files. Statements like this are almost always found in clinical trial reports.

A translation, of the first one:

“We estimated that if we recruited 3000 participants and the true absolute difference between intervention and placebo is 6.3 percentage points, then if we assumed that there was no difference between the groups, the probability (under this assumption of no difference) of getting data that were as unusual or more unusual than those we actually obtained would be less than 0.05 in 90% of a long series of replications of the trial.”

That’s what it actually means but I guess most clinicians and researchers would find that pretty impenetrable. An awful lot is hidden by the simple word “detect” in the sample size justification statements. I suspect the language (“detect a difference”) feeds into the misunderstandings of ”significant” results – it’s a real difference, not due to chance, etc.

February 11, 2017

Andrew Gelman agrees with me!

Follow-up to The Fragility Index for clinical trials from Evidence-based everything

I’ve slipped in my plan to do a new blog post every week, but here’s a quick interim one.

I blogged about the fragility index a few months back ( Andrew Gelman has also blogged about this, and thought much the same as I did (OK, I did ask him what he thought).

See here:

October 2021

Mo Tu We Th Fr Sa Su
Sep |  Today  |
            1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Search this blog



Most recent comments

  • Hi Tom Sorry for delay in replying – taken out by family issues then holiday for the last month or s… by Simon Gates on this entry
  • Simon, I can see where you're coming from on this. If MCID (in its various guises) is not an optimal… by Chee-Wee Tan on this entry
  • Hi Simon I am currently doing my PhD in clinical based research. We want to use the MCID to determin… by tomwilks on this entry
  • I think your comment reveals how nonsensical null hypothesis testing is (and I see from your other p… by matt on this entry
  • Thanks for commenting Matt – I do wonder if anyone ever looks at any of this, not that this is a pro… by Simon Gates on this entry

Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder