*Rct*

#
All 13 entries tagged

No other Warwick Blogs use the tag *Rct* on entries | View entries tagged *Rct* at Technorati | There are no images tagged *Rct* on this blog

## November 01, 2017

### Bayesian trial in the real world

This post arose from a discussion on Twitter about a recently-published randomised trial. Twitter isn’t the best forum for debate so I wanted to summarise my thoughts here in some more detail.

What was interesting about the trial was that it used a Bayesian analysis, but this provoked a lot of reaction on Twitter that seemed to miss the mark a bit. There were some features of the analysis that some people found challenging, and the Bayesian methods tended to get the blame for that, incorrectly in my view.

First, a bit about the trial. It’s this one:

**Laptook et al** *Effect of Therapeutic Hypothermia Initiated After 6 Hours of Age on Death or Disability Among Newborns With Hypoxic-Ischemic Encephalopathy. JAMA 2017; 318(16): 1550-1560.*

This trial randomised infants with hypoxic ischaemic encephalopathy who were aged over 6 hours to cooling to 33.5 C for 96 hours (to prevent brain injury) or no cooling. Earlier studies have established that cooling started in the first 6 hours after birth reduces death and disability, so it is plausible that starting later might also help, though maybe the effect would be smaller. The main outcome was death or disability at 18 months.

The methodological interest here is that they used a Bayesian final analysis, because they felt that they would only be able to recruit a restricted number of infants, and a Bayesian analysis would be more informative, as it can quantify the probability of the treatment’s benefit, rather than giving the usual significant/non-significant = works/doesn’t work dichotomy.

The main outcome occurred in 19/78 in the hypothermia group and 22/79 in the no hypothermia group. Their analysis used three different priors, a neutral prior (centred on RR 1.0), an enthusiastic prior, centred on RR 0.72 (as found in an earlier trial of hypothermia started before 6 hours), and a sceptical prior, centred on RR 1.10. The 95% interval for the neutral prior was from 0.5 to 2.0, so moderately informative.

The results for the Bayesian analysis with the neutral prior that were presented in the paper were: an adjusted risk ratio of 0.86, with 95% interval from 0.58 to 1.29, and 76% probability of the risk ratio being less than 1.

OK, that’s the background.

Here are some unattributed Twitter reactions:

“This group (mis)used Bayesian methods to turn a 150 pt trial w P=0.62 into a + result w/ post prob efficacy of 76%!”

“I think the analysis is suspicious, it moves the posterior more than the actual effect size in study, regardless which prior chosen

Primary outcome is 24.4% v 27.9% which is RR of 0.875 at best. Even with a weak neutral prior, should not come up with aRR to 0.86

Also incredibly weak priors with high variance chosen, with these assumptions, even a n=30 trial would have shifted the posterior.”

There were some replies from Bayesian statisticians, saying (basically) no, it looks fine. The responses were interesting to me, as I have frequently said that Bayesian methods would help clinicians to understand results from clinical trials more easily. Maybe that’s not true! So it’s worth digging a bit into what’s going on.

First, on the face of it 19 versus 22 patients with the outcome (that’s 24.4% versus 27.8%) doesn’t look like much of a difference. It’s the sort of difference that all of us are used to seeing described as “non-significant,” followed by a conclusion that the treatment was not effective or something like that. So to see this result meaning a probability of benefit of 76% might look as if it’s overstating the case.

Similarly, the unadjusted risk ratio was about 0. 875, but the Bayesian neutral-prior analysis had RR=0.86; so it looks as though there has been some alchemy in the Bayesian analysis to increase the effect size.

So is there a problem or not? First, the 76% probability of benefit just means 76% posterior probability (based on the prior, model and data) that the risk ratio is less than 1. There’s quite a sizeable chunk of that probability where the effect size is very small and not really much of a benefit, so it’s not 76% probability that the treatment does anything useful. The paper actually reported the probability that the absolute risk difference was >2%, which was 64%, so quite a bit lower.

Second, 76% probability of a risk ratio less than 1 also means 24% probability that it is more than 1, so there is a fairly substantial probability that the treatment isn’t beneficial at all. I guess we are more used to thinking of results in terms of “working” or “not working” and a 76% probability sounds like a high probability of effectiveness.

Third, the point estimate. The critical point here is that the results presented in the paper were adjusted estimates, using baseline measures of severity as covariates. The Bayesian analysis with neutral prior centred on 1 would in fact pull the risk ratio estimate towards 1; the reason the final estimate (0.86) shows a bigger effect than the unadjusted estimate (0.875) is the adjustment, not the Bayesian analysis. The hypothermia group was a bit more severely affected than the control group, so the unadjusted estimate is over-conservative (too near 1), and the covariate adjustment has reduced the risk ratio. So even when pulled back towards 1 by the neutral prior, it’s still lower than the unadjusted estimate.

Another Twitter comment was that the neutral prior was far too weak, and gave too much probability to unrealistic effect sizes. The commenter advocated using a much narrower prior centred on 1, but with much less spread. I don’t agree with that though, mainly because assuming such a prior would be equivalent to assuming more data in the prior than in the actual trial, which doesn’t seem sensible when it isn’t based on actual real data.

The other question about priors is what would be a reasonable expectation based on what we know already? If we believe that early starting of hypothermia gives a substantial benefit (which several trials have found, I think), then it seems totally reasonable that a later start might also be beneficial, just maybe a bit less so. The results are consistent with this interpretation – the most probable risk ratios are around 0.85.

Going further, the division into “early” or “late” starting of hypothermia (before or after 6 hours of age) is obviously artificial; there isn’t anything that magically happens at 6 hours, or any other point. Much more plausible is a decline in effectiveness with increasing time to onset of hypothermia. It would be really interesting and useful to understand that relationship, and the point at which it wasn’t worth starting hypothermia. That would be something that could be investigated with the data from this and other trials, as they all recruited infants with a range of ages (in this trial it was 6 to 24 hours). Maybe that’s an individual patient data meta-analysis project for someone.

## September 21, 2017

### Best sample size calculation ever!

I don't want to start obsessing about sample size calculations, because most of the time they're pretty pointless and irrelevant, but I came across a great one recently.

My award for least logical sample size calculation goes to Mitesh Patel et al, Intratympanic methylprednisolone versus gentamicin in patients with unilateral Meniere's disease: a randomised, comparative effectiveness trial, in The Lancet, 2016, 388: 2753-62.

The background: Meniere's disease causes vertigo attacks and hearing loss. Gentamicin, the standard treatment, improves vertigo but can worsen hearing. So the question is whether an alternative treatment, methylprednisolone, would be better - as good in reducing vertigo, and better in terms of hearing loss. That's actually not what the trial did though - it had frequency of vertigo attacks as the primary outcome. You might question the logic here; if gentamicin is already good at reducing vertigo, you might get no or only a small improvement with methylprednisolone, but methylprednisolone might not cause as much hearing loss. So you want methylprednisolone to be better at reducing hearing loss, as long as it's nearly as good as gentmicin at reducing vertigo.

Anyway, the trial used vertigo as its primary outcome, and recruited 60 people, which was its pre-planned sample size. But when you look at the sample size justification, it's all about hearing loss! Er... that's a completely different outcome. They based the sample size of 60 people on "detecting" a difference of (i.e. getting statistical significance if the true difference was) 9dB (sd11). Unsurprisingly, the trial didn't find a difference in vertigo frequency.

This seems to be cheating. If you're going to sign up to the idea that it's meaningful to pre-plan a sample size based on a significance test, it seems important that it should have some relation to the main outcome. Just sticking in a calculation for a different outcome doesn't really seem to be playing the game. I guess it ticks the box for "including a sample size calculation" though. Hard to believe that the lack of logic escaped the reviewers here, or maybe the authors managed to convince them that what they did made sense (in which case, maybe they could get involved in negotiating Brexit?).

Here's their section on sample size, from the paper in The Lancet:

## August 19, 2017

### Trial results infographics

There is a fashion for producing eye-catching infographics of trial results. This is a good thing in some ways, because it’s important to get the results communicated to doctors and patients in a way they can understand. Here’s one from the recent WOMAN trial (evaluating tranexamic acid for postpartum haemorrhage).

What’s wrong with this? To my mind the main problem is that if you reduce the messages to a few headlines then you end up leaving out a lot of pretty important information. One obvious thing missing from these results is uncertainty. We don’t know, based on the trial’s results, that the number of women bleeding to death would be reduced by 30% – that’s just the point estimate, and there’s substantial uncertainty about this.

Actually the reduction by 30% isn’t the trial’s main result, which has the risk ratio for death due to haemorrhage as 0·81, 95% CI 0·65–1·00. So that’s actually a point estimate reduction of 19%, with a range of effects “consistent with the data” (or not significantly different from the data) of a reduction between 35% and zero. The 30% reduction seems to come from a subgroup analysis of women treated within 3 hours of delivery. A bit naughty to use a subgroup analysis as your headline result, but this highlights another problem with the infographic – you don’t really know what you’re looking at. In this case they have chosen to present a result that the investigators presumably feel represents the real treatment effect – but others might have different views, and there isn’t any way of knowing that you’re seeing results that have been selected to support a particular story.

[I’m guessing that the justification for presenting the “<3 hour” subgroup is that there wasn’t a clear effect in the “>3 hour” subgroup (RR 1.07, 95% CI 0.76, 1.51), so the belief is that treatment needs to be given within 3 hours to be effective. There could well be an effect of time from delivery, but it neds a better analysis than this.]

WOMAN trial: Lancet, Volume 389, No. 10084, p2105–2116, 27 May 2017

PS And what’s with the claim at the top that the drug could save 1/3 of the women who would otherwise die from bleeding after childbirth? That’s not the same as 30%, which wasn’t the trial’s result anyway. I guess a reduction of 1/3 is a possible result but so are reductions of 25% or 10%.

## July 18, 2017

### The future is still in the future

I just did a project with a work experience student that involved looking back through four top medical journals for the past year (NEJM, JAMA, Lancet and BMJ), looking for reports of randomised trials. As you can imagine, there were quite a lot - I'm not sure exactly how many because only a subset were eligible for the study we were doing. We found 89 eligible for our study, so there were probably at least 200 in total.

Of all those trials, I saw only ONE that used Bayesian statistical methods. The rest were still doing all the old stuff with null hypotheses and significance testing.

## February 11, 2017

### Andrew Gelman agrees with me!

Follow-up to The Fragility Index for clinical trials from Evidence-based everything

I’ve slipped in my plan to do a new blog post every week, but here’s a quick interim one.

I blogged about the fragility index a few months back (http://blogs.warwick.ac.uk/simongates/entry/the_fragility_index/). Andrew Gelman has also blogged about this, and thought much the same as I did (OK, I did ask him what he thought).

See here: http://andrewgelman.com/2017/01/03/statistics-meets-bureaucracy/

## November 03, 2016

### Statistical significance and decision–making

One of the defences of the use of traditional “null hypothesis significance testing” (NHST) in clinical trials is that, at some point, it is necessary to make a decision about whether a treatment should be used, and “statistical significance” gives us a way of doing that. I hear versions of this argument on a regular basis.

But the argument has always seemed to me to be ridiculous. Even if significance tests could tell you that the null hypothesis was wrong (they can’t), that doesn’t give you any basis for a sensible decision. A null hypothesis being wrong doesn’t tell you whether the treatment has a big enough effect to be worth implementing, and it takes no account of other important things, like cost-effectiveness, safety, feasibility or patient acceptability. Not a good basis for what are potentially life and death decisions.

But don’t listen to me: listen to The American Statistical Association. Their Statement on Statistical Significance and P-Values from earlier this year addresses exactly this point. The third of their principles is:

**“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”**

Pretty unambiguous, I think.

## October 12, 2016

### “Something is rotten in the state of Denmark”

The DANISH trial (in which, pleasingly, the D stands for “Danish”, and it was conducted in Denmark too), evaluated the use of Implantable Cardioverter Defibrillators (ICD) in patients with heart failure that was not due to ischaemic heart disease. The idea of the intervention is that it can automatically restart the heart in the event of a sudden cardiac arrest – so it might help these patients, who are at increased risk of their heart stopping suddenly (obviously there is a lot more clinical detail to this).

The trial recruited 1116 patients and found that the primary outcome (death from any cause) occurred in 120/556 (21.6%) in the ICD group and 131/560 (23.4%) in control; a hazard ratio of 0.87, 95% CI 0.68, 1.12. The conclusion was (from the abstract):

“prophylactic ICD implantation … was not associated with a significantly lower long-term rate of death from any cause than was usual clinical care”;

and from the end of the paper:

“prophylactic ICD implantation … was not found to reduce longterm mortality.”

Note, in passing, the subtle change from “no significant difference” in the abstract, which at least has a chance of being interpreted as a statement about statistics, to “not found to reduce mortality” – a statement about the clinical effects. Of course the result doesn’t mean that, but the error is so common as to be completely invisible.

Reporting of the trial mostly put it across as showing no survival improvement, for example:

https://healthmanagement.org/c/cardio/news/danish-trial-icds-in-non-ischaemic-heart-failure

http://www.medscape.com/viewarticle/868065

http://www.tctmd.com/show.aspx?id=136105

The main issue in this trial, however, was that the ICD intervention DID reduce sudden cardiac death, which is what the intervention is supposed to do: 24/556 (4.3%) in the ICD group and 46/560 (8.2%) in control, hazard ratio 0.50 (0.31, 0.82). All cardiovascular deaths (sudden and non-sudden) were also reduced in the ICD group, but not by so much: HR 0.77 (0.57, 1.05). You might expect a result like this if the ICD reduced sudden cardiac deaths, but in addition to this both groups have similar risk of non-sudden cardiac death. When all deaths are counted (including cardiac and other causes), the difference in the outcome that the intervention can affect starts getting swamped by outcomes that it doesn’t reduce. The sudden cardiac deaths make up a small proportion of the total, so the overall difference between the groups is dominated by deaths that weren’t likely to differ between the groups, and the difference in all-cause mortality is much smaller (and “non-significant”). So all of the results seem consistent with the intervention reducing the thing it is intended to reduce, by quite a lot, but there also being a lot of deaths due to other causes that aren’t affected by the intervention. To get my usual point in, if Bayesian methods were used, you would find a substantially greater probability of benefit for the intervention for cardiovascular death and all-cause mortality.

All-cause death was chosen as the primary outcome, and following convention, the conclusions are based on this. But the conclusion is sensitive to the choice of primary outcome: if sudden cardiac death had been the primary outcome, the trial would have been regarded as “positive”.

So, finally, to get around to the general issues. It is the convention in trials to nominate a single “primary outcome”, which is used for calculating a target sample size and for drawing the main conclusions of the study. Usually this comes down to saying there was benefit (“positive trial”) if the result gets a p-value of less than 0.05, and not if the p-value exceeds 0.05 (“negative trial”). The expectation is that a single primary outcome will be nominated (sometimes you can get away with two), but that means that the conclusions of the trial will be sensitive to this choice. I think the reason for having a single primary outcome stems from concerns over type I errors if lots of outcomes are analysed. You could them claim a “positive” trial and treatment effectiveness if any of them turned out “significant” – though obviously restricting yourself to a single primary outcome is a pretty blunt instrument for addressing multiple analysis issues.

There are lots of situations where it isn’t clear that a single outcome is sufficient for drawing conclusions from a trial, as in DANISH: the intervention should help by reducing sudden cardiac death, but that won’t be any help if it increases deaths for other reasons – so both sudden cardiac deaths and overall deaths are important. Good interpretation isn’t helped by the conventions (=bad habits) of equating “statistical significance” with clinical importance, and labelling the treatment as effective or not based on a single primary outcome.

Reference for DANISH trial: N Engl J Med 2016; 375:1221-1230, September 29, 2016

http://www.nejm.org/doi/full/10.1056/NEJMoa1608029

## April 14, 2016

### NEJM letter and cardiac arrest trial

I recently had a letter in the New England Journal of Medicine, about a trial they had published that compared continuous versus interrupted chest compressions during resuscitation after cardiac arrest. Interrupted compressions are standard care - the interruptions are for ventilations to oxygenate the blood, prior to resuming chest compressions to keep it circulating. The issue was that the result of the trial was 0.7% better survival in the interrupted-compression group, with 95% CI from -1.5% to 0.1%. So the data are suggesting a probable benefit to interrupted compressions. However, on Twitter the NEJM announced this as “no difference”, no doubt because the difference was not “statistically significant”. So I wrote pointing out that this wasn’t a good interpretation, and the dichotomy into “significant” and “non-significant” is pretty unhelpful in situations where the results are close to “significance”. Bayesian methods have a huge advantage here, in that they can actually quantify the probability of benefit. An 80% probability that the treatment is beneficial is a lot more useful than “non-significance”, and might lead to very different actions.

The letter was published along with a very brief reply from the authors (they were probably constrained, as I was in the original letter, by a tiny word limit): *“Bayesian analyses of trials are said to offer some advantages over traditional frequentist analyses. A limitation of the former is that different people have different prior beliefs about the effect of treatment. Alternative interpretations of our results offered by others show that there was not widespread clinical consensus on these prior beliefs. We are not responsible for how the trial results were interpreted on Twitter.”*

Taking the last point first: no, the authors did not write the Twitter post. But they also did not object to it. I'm not accusing them of making the error that non-significance = no difference, but it is so common that it usually - as here - passes without comment. But it's just wrong.

Their initial point about priors illustrates a common view, that Bayesian analysis is about incorporating individual prior beliefs into the analysis. While you can do this, it is neither necessary nor a primary aim. As Andrew Gelman has said (and I have repeated before); prior information not prior beliefs. We want to base a prior on the information that we have at the start of the trial, and if that is no information, then that’s fine. However, we almost always do have some information on what the treatment effect might plausibly be. For example, it’s very unusual to find an odds ratio of 10 in any trial, so an appropriate prior would make effects of this (implausible) size unlikely. More importantly, in this case, getting too hung up on priors is a bit irrelevant, because the trial was so huge (over 20,000 participants) that the data will completely swamp any reasonable prior.

It isn’t possible to re-create the analysis from the information in the paper, as it was a cluster-randomised trial with crossover, which needs to be taken into account. Just using the outcome data for survival to discharge in a quick and dirty Bayesian analysis, though, gives a 95% credible interval of something like from 0.84 to 1.00, with a probability of the odds ratio being less than 1 of about 98%. That probably isn’t too far away from the correct result, and suggests pretty strongly that survival may be a bit worse in the continuous compression group. “No difference” just doesn’t seem like an adequate summary to me.

My letter and the authors’ reply are here: http://www.nejm.org/doi/full/10.1056/NEJMc1600144

The original trial report is here: Nichol G, Leroux B, Wang H, et al. Trial of continuous or interrupted chest compressions during CPR. N Engl J Med 2015;373:2203-2214 http://www.nejm.org/doi/full/10.1056/NEJMoa1509139

## December 09, 2015

### Why do they say that?

A thing I've heard several times is that Bayesian methods might be advantageous for Phase 2 trials but not for Phase 3. I've struggled to understand why people would think that. To me, the advantage of Bayesian methods comes in the fact that the methods make sense, answer relevant questions and give understandable answers, which seem just as important in Phase 3 trials as in Phase 2.

One of my colleagues gave me his explanation, which I will paraphrase. He made two points:

*1. Decision-making processes are different after Phase 2 and Phase 3 trials; folowing Phase 2 decisions about whether to proceed further are made by researchers or research funders, but after Phase 3 decisons (about use of therapies presumably) are taken by "society" in the form of regulators or healthcare providers. This makes the Bayesian approach harder as it is harder to formulate a sensible prior (for Phase 3 I think he means).*

*2. In Phase 3 trials sample sizes are larger so the prior is almost always swamped by the data, so Bayesian methods don't add anything.*

My answer to point 1: Bayesian methods are about more than priors. I think this criticism comes from the (limited and in my view somewhat misguided) view of priors as a personal belief. That is one way of specifying them but not the most useful way. As Andrew Gelman has said, prior INFORMATION not prior BELIEF. And you can probably specify information in pretty much the same way for both Phase 2 and Phase 3 trials.

My answer to point 2: Bayesian methods aren't just about including prior information in the analysis (though they are great for doing that if you want to). I'll reiterate my reasons for preferring them that I gave earlier - the methods make sense, answer relevant questions and give understandable answers. Why would you want to use a method that doesn't answer the question and nobody understands? Also, If you DO have good prior information, you can reach an answer more quickly by incorporating that in the analysis - which we kind of do by doing trials and then combining them with others in meta-analyses; but doing it the Bayesian way would be neater and more efficient.

## June 17, 2013

### Sample size and the Minimum Clinically Important Difference

Performing a sample size calculation has become part of the rigmarole of randomized trials and is now expected as a sign of “quality”. For example, the CONSORT guidelines include reporting of a sample size calculation as one of the items that should be included in a trial report, and many quality scales and checklists include presence of a sample size calculation as one of the quality markers. Whether any of this is right or just folklore is an interesting issue that receives little attention. [I’m intending to come back to this issue in future posts]

For now I want to focus on one aspect of sample size calculations that seems to me not to make much sense.

In the usual idealized sample size calculation, a treatment effect that it is desired to detect is assumed. Ideally this should be the “minimum clinically important difference” (MCID); the smallest difference that it would be worthwhile to know about, or the smallest difference that would lead to one treatment being favoured over the other in clinical practice. Obviously this is not an easy thing to calculate, but leaving practical issues to one side for the moment, in an ideal situation you would have a good idea of the MCID. Having established the MCID, this is used as the treatment effect in a standard sample size calculation, based on a significance test (almost always at the 5% significance level) and a specified level of power (almost invariably 80% or 90%). This gives a number of patients that need to be recruited. This number will give a “statistically significant” difference the specified percentage of the time (power) if the true difference is the MCID.

The problem here is that the sample size calculation is based on finding a statistically significant result, not demonstrating that the difference is larger than a certain size. But if you have identified a minimum clinically important difference, what you want to be able to say with a high degree of confidence is whether the treatment effect exceeds it. However, the standard sample size calculation is based on statistical significance, which is equivalent to finding that the difference that is non-zero. Obviously, the upper confidence limit is likely to be close to zero and will only rarely be far enough from zero to exclude the MCID. Hence the standard sample size may have adequate power to show whether there is a non-zero difference, but has very little power to show that the difference exceeds the MCID. Hence most results will be inconclusive; they will show that there is evidence of benefit, but uncertainty that it large enough to be clinically important.

As an example, imagine the MCID is thought to be a risk ratio of 0.75 (a bad outcome occurs in 40% of the control group and 30% of the intervention group). A standard sample size calculation gives 350 participants per group. So you do the trial and (unusually!) the proportions are exactly as expected: 40% in the control and 30% in the intervention group. The calculated risk ratio is 0.75 but the 95% confidence interval around this is 0.61 to 0.92. So you can conclude that the treatment has a non-zero effect but you don’t know whether it exceeds the minimum clinically important difference. With this result you would only have a 50% chance that the real treatment effect exceeded the MCID.

So sizing a trial based on the MCID might seem like a good idea, but in fact if you use the conventional methods, the result is probably not going to give you much information about whether the treatment effect really is bigger than the MCID or not. I suspect that in most cases the excitement of a “statistically significant” result overrides any considerations of the strength of the evidence that the effect size is clinically useful.