August 19, 2017

Trial results infographics

There is a fashion for producing eye-catching infographics of trial results. This is a good thing in some ways, because it’s important to get the results communicated to doctors and patients in a way they can understand. Here’s one from the recent WOMAN trial (evaluating tranexamic acid for postpartum haemorrhage).

WOMAN trial 2

What’s wrong with this? To my mind the main problem is that if you reduce the messages to a few headlines then you end up leaving out a lot of pretty important information. One obvious thing missing from these results is uncertainty. We don’t know, based on the trial’s results, that the number of women bleeding to death would be reduced by 30% – that’s just the point estimate, and there’s substantial uncertainty about this.

Actually the reduction by 30% isn’t the trial’s main result, which has the risk ratio for death due to haemorrhage as 0·81, 95% CI 0·65–1·00. So that’s actually a point estimate reduction of 19%, with a range of effects “consistent with the data” (or not significantly different from the data) of a reduction between 35% and zero. The 30% reduction seems to come from a subgroup analysis of women treated within 3 hours of delivery. A bit naughty to use a subgroup analysis as your headline result, but this highlights another problem with the infographic – you don’t really know what you’re looking at. In this case they have chosen to present a result that the investigators presumably feel represents the real treatment effect – but others might have different views, and there isn’t any way of knowing that you’re results that have been selected to support a particular story.

[I’m guessing that the justification for presenting the “<3 hour” subgroup is that there wasn’t a clear effect in the “>3 hour” subgroup (RR 1.07, 95% CI 0.76, 1.51), so the belief is that treatment needs to be given within 3 hours to be effective. There could well be an effect of time from delivery, but it neds a better analysis than this.]

WOMAN trial: Lancet, Volume 389, No. 10084, p2105–2116, 27 May 2017

PS And what’s with the claim at the top that the drug could save 1/3 of the women who would otherwise die from bleeding after childbirth? That’s not the same as 30%, which wasn’t the trial’s result anyway. I guess a reduction of 1/3 is a possible result but so are reductions of 25% or 10%.

July 18, 2017

The future is still in the future

I just did a project with a work experience student that involved looking back through four top medical journals for the past year (NEJM, JAMA, Lancet and BMJ), looking for reports of randomised trials. As you can imagine, there were quite a lot - I'm not sure exactly how many because only a subset were eligible for the study we were doing. We found 89 eligible for our study, so there were probably at least 200 in total.

Of all those trials, I saw only ONE that used Bayesian statistical methods. The rest were still doing all the old stuff with null hypotheses and significance testing.

May 30, 2017

Hospital–free survival

One of the consequences of the perceived need for a “primary outcome” is that people try to create a single outcome variable that will include all or most of the important effects, and will increase the incidence of the outcome, or in some other way allow the sample size calculation to give you a smaller target. There has for some time been a movement to use “ventilator-free days” in critical care trials, but a recent trend is for trials of treatments for cardiac arrest to use “hospital-free survival” or “ICU-free survival,” defined as the number of days that a trial participant was alive and not in hospital or ICU, up to 30 days post randomisation.

A recent example is Nicholl et al (2015), who compared continuous versus interrupted chest compressions during CPR. It was a massive trial, randomising over 23,000 participants, and found 9% survival with continuous compressions and 9.7% with interrupted. Inevitably this was described as “continuous chest compressions during CPR performed by EMS providers did not result in significantly higher rates of survival.” But it did result in a “significant” difference in hospital-free survival, which was a massive 0.2 days shorter in the continuous compression group (95% CI -0.01, -0.03, p=0.004).
A few comments. First, with a trial of this size and a continuous outcome, it’s almost impossible not to get statistical significance, even if the effect is tiny. As you can see. I very much doubt that anyone would consider an improvement in hospital-free survival of 0.2 days (that’s about 4 hours 48 minutes) of any consequence, but it’s there in the abstract as a finding of the trial.

Second, it’s a composite outcome, and like almost all composite outcomes, it includes things that are very different in their importance; in this case, whether the patient is alive or dead, and whether they are in hospital. It’s pretty hard to interpret this. What does a difference of about 5 hours in the time alive and out of hospital mean? Would a patient think that was a good reason to use the intervention? I doubt it. They would surely be more interested in the chances of surviving, and maybe secondarily whether the amount of time they might spend in hospital would be different.

Third, and this is especially true for cardiac arrest trials, the mean is a terrible way to summarise these data. The survival rate in this trial was about 9%. The vast majority of deaths would have occurred either before reaching hospital or in hospital, so all of those patients would have hospital-free survival of zero. The 9% or so of patients that survived to hospital discharge would have a number of hospital-free days between 0 and 30. So the means for each group will be pulled strongly towards zero by the huge number of participants with zero hospital-free days. The means for each group are presented in Table 3 of the paper, as 1.3 ± 5.0, and 1.5 ± 5.3, without comment, even though that seems to imply negative hospital-free survival. Definitely a case here for plotting the data to see what is going on; the tabulated summary is inadequate. The difference is almost certainly driven by the 0.7% higher survival in the interrupted compression group, which was possibly an important finding. However, because it was non-significant it is pretty much ignored and assumed to be zero.

Nichol G et al. Trial of Continuous or Interrupted Chest Compressions during CPR. NEJM 2015; 373: 2203-2214.

April 07, 2017


altman slide

Here’s a photo of a slide from a talk by Doug Altman about hypothesis tests and p-values recently (I nicked the picture from Twitter, additions by me). I wasn’t there so I don’t know exactly what Doug said, but I totally agree that hypothesis testing and p-values are a massive problem.

Nearly five years ago (July 2012 if I remember correctly) I stood up in front of the Warwick Medical School Division of Health Sciences, in a discussion about adopting a “Grand Challenge” for the Division, and proposed that our “Grand Challenge” should be to abandon significance testing. The overwhelming reaction was blank incomprehension. There was a vote for which of the four or five proposals to adopt, and ONE PERSON voted for my idea (for context, as a rough guess, there were probably 200-300 people in the Division at that time).

It was certainly well-known before 2012 that hypothesis tests and p-values were a real problem, but that didn’t seem to have filtered through to many medical researchers.

March 05, 2017

Sample size statement translation

Here are a couple of statements about the justification of the sample size from reports of clinical trials in high-impact journals (I think one is from JAMA and the other from NEJM):

We estimated that a sample size of 3000 … would provide 90% power to detect an absolute difference of 6.3 percentage points in the rate of [outcome] between the [intervention] group and the placebo group.

The study was planned to detect a difference of 1.1 points in the [outcome score] between the 2 interventions with a significance level of .05 and a power level of 90%.

There is nothing remarkable about these at all; they were just the first two that I came across in rummaging through my files. Statements like this are almost always found in clinical trial reports.

A translation, of the first one:

“We estimated that if we recruited 3000 participants and the true absolute difference between intervention and placebo is 6.3 percentage points, then if we assumed that there was no difference between the groups, the probability (under this assumption of no difference) of getting data that were as unusual or more unusual than those we actually obtained would be less than 0.05 in 90% of a long series of replications of the trial.”

That’s what it actually means but I guess most clinicians and researchers would find that pretty impenetrable. An awful lot is hidden by the simple word “detect” in the sample size justification statements. I suspect the language (“detect a difference”) feeds into the misunderstandings of ”significant” results – it’s a real difference, not due to chance, etc.

February 11, 2017

Andrew Gelman agrees with me!

Follow-up to The Fragility Index for clinical trials from Evidence-based everything

I’ve slipped in my plan to do a new blog post every week, but here’s a quick interim one.

I blogged about the fragility index a few months back ( Andrew Gelman has also blogged about this, and thought much the same as I did (OK, I did ask him what he thought).

See here:

December 14, 2016

Bayesian methods and trials in rare and common diseases

One of the places that Bayesian methods have made some progress in the clinical trials world is in very rare diseases. And it’s true, traditional methods are hopeless in this situation, where you can never get enough recruits to get anywhere near the sample size that traditional methods demand for an “adequately powered” study, and it’s unlikely that a result will be “statistically significant”. Bayesian methods really help here, because they give you a result in terms of probability that a treatment is superior. This is good for two main reasons. First, it’s helpful to quantify the probability of benefit, and its size and uncertainty. This tells us a lot more than simply dichotomising it into “significant” and “non-significant”, with the unstated assumption that “significant” means clinically useful. Second, there isn’t a fixed probability of benefit that means an intervention should be used; it will vary from situation to situation. For example, if there is almost no cost to using a treatment, it might only need a small probability of being better to be worthwhile. If we don’t estimate this probability we can’t make this sort of judgement.

But (and this is something I have experienced several times now in a variety of places so I think it is real) – this seems to have had an unfortunate side effect. A perception seems to have grown that Bayesian methods are something to consider using when a “proper” trial (with all of the usual stuff: interpretation based on p < 0.05 in a null hypothesis test, fixed pre-planned sample size based on a significance test, 80% or 90% power and so on) isn’t feasible. In reality, the ability to quantify probability of benefit would be helpful in just about all situations, even (or especially) large Phase 3 trials that are looking for modest treatment benefits. How many of these trials don’t “achieve statistical significance” but have results that would show a 70% or 80% probability of benefit? They might still provide good enough evidence to make decisions about treatments (based on, for example, cost-effectiveness), but at the moment they tend to get labelled as “non-significant” or “negative trials.”

November 12, 2016

“The probability that the results are due to chance”

One of the (wrong) explanations that you often see of what a p-value means is “the probability that data have arisen by chance.” I think people may struggle to see why this is wrong, as I did for a long time. A p-value is the probability of getting the data (or more extreme data) if the null hypothesis (no difference) is correct – right? So that would mean the specific result you got must have been due to chance variation, doesn’t it? So why isn’t the p-value the probability that the result was due to chance?

The problem is that there are two ways of interpreting “the probability that a result is due to chance.”
1. The probability that chance or random variation was the process that produced the result;
2. The probability of getting the specific data (or more extreme data) that you got in your experiment, if chance was the only process operating.

The second of these is what the p-value tells you; but the first is the interpretation that most people give it. The p-value tells you nothing about the process that produced the result, because it is calculated on the assumption that the null hypothesis is correct.

November 03, 2016

Statistical significance and decision–making

One of the defences of the use of traditional “null hypothesis significance testing” (NHST) in clinical trials is that, at some point, it is necessary to make a decision about whether a treatment should be used, and “statistical significance” gives us a way of doing that. I hear versions of this argument on a regular basis.

But the argument has always seemed to me to be ridiculous. Even if significance tests could tell you that the null hypothesis was wrong (they can’t), that doesn’t give you any basis for a sensible decision. A null hypothesis being wrong doesn’t tell you whether the treatment has a big enough effect to be worth implementing, and it takes no account of other important things, like cost-effectiveness, safety, feasibility or patient acceptability. Not a good basis for what are potentially life and death decisions.

But don’t listen to me: listen to The American Statistical Association. Their Statement on Statistical Significance and P-Values from earlier this year addresses exactly this point. The third of their principles is:

“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”

Pretty unambiguous, I think.

October 12, 2016

“Something is rotten in the state of Denmark”

The DANISH trial (in which, pleasingly, the D stands for “Danish”, and it was conducted in Denmark too), evaluated the use of Implantable Cardioverter Defibrillators (ICD) in patients with heart failure that was not due to ischaemic heart disease. The idea of the intervention is that it can automatically restart the heart in the event of a sudden cardiac arrest – so it might help these patients, who are at increased risk of their heart stopping suddenly (obviously there is a lot more clinical detail to this).

The trial recruited 1116 patients and found that the primary outcome (death from any cause) occurred in 120/556 (21.6%) in the ICD group and 131/560 (23.4%) in control; a hazard ratio of 0.87, 95% CI 0.68, 1.12. The conclusion was (from the abstract):

“prophylactic ICD implantation … was not associated with a significantly lower long-term rate of death from any cause than was usual clinical care”;

and from the end of the paper:

“prophylactic ICD implantation … was not found to reduce longterm mortality.”

Note, in passing, the subtle change from “no significant difference” in the abstract, which at least has a chance of being interpreted as a statement about statistics, to “not found to reduce mortality” – a statement about the clinical effects. Of course the result doesn’t mean that, but the error is so common as to be completely invisible.

Reporting of the trial mostly put it across as showing no survival improvement, for example:

The main issue in this trial, however, was that the ICD intervention DID reduce sudden cardiac death, which is what the intervention is supposed to do: 24/556 (4.3%) in the ICD group and 46/560 (8.2%) in control, hazard ratio 0.50 (0.31, 0.82). All cardiovascular deaths (sudden and non-sudden) were also reduced in the ICD group, but not by so much: HR 0.77 (0.57, 1.05). You might expect a result like this if the ICD reduced sudden cardiac deaths, but in addition to this both groups have similar risk of non-sudden cardiac death. When all deaths are counted (including cardiac and other causes), the difference in the outcome that the intervention can affect starts getting swamped by outcomes that it doesn’t reduce. The sudden cardiac deaths make up a small proportion of the total, so the overall difference between the groups is dominated by deaths that weren’t likely to differ between the groups, and the difference in all-cause mortality is much smaller (and “non-significant”). So all of the results seem consistent with the intervention reducing the thing it is intended to reduce, by quite a lot, but there also being a lot of deaths due to other causes that aren’t affected by the intervention. To get my usual point in, if Bayesian methods were used, you would find a substantially greater probability of benefit for the intervention for cardiovascular death and all-cause mortality.

All-cause death was chosen as the primary outcome, and following convention, the conclusions are based on this. But the conclusion is sensitive to the choice of primary outcome: if sudden cardiac death had been the primary outcome, the trial would have been regarded as “positive”.

So, finally, to get around to the general issues. It is the convention in trials to nominate a single “primary outcome”, which is used for calculating a target sample size and for drawing the main conclusions of the study. Usually this comes down to saying there was benefit (“positive trial”) if the result gets a p-value of less than 0.05, and not if the p-value exceeds 0.05 (“negative trial”). The expectation is that a single primary outcome will be nominated (sometimes you can get away with two), but that means that the conclusions of the trial will be sensitive to this choice. I think the reason for having a single primary outcome stems from concerns over type I errors if lots of outcomes are analysed. You could them claim a “positive” trial and treatment effectiveness if any of them turned out “significant” – though obviously restricting yourself to a single primary outcome is a pretty blunt instrument for addressing multiple analysis issues.

There are lots of situations where it isn’t clear that a single outcome is sufficient for drawing conclusions from a trial, as in DANISH: the intervention should help by reducing sudden cardiac death, but that won’t be any help if it increases deaths for other reasons – so both sudden cardiac deaths and overall deaths are important. Good interpretation isn’t helped by the conventions (=bad habits) of equating “statistical significance” with clinical importance, and labelling the treatment as effective or not based on a single primary outcome.

Reference for DANISH trial: N Engl J Med 2016; 375:1221-1230, September 29, 2016

August 2017

Mo Tu We Th Fr Sa Su
Jul |  Today  |
   1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31         

Search this blog



Most recent comments

  • Hi Tom Sorry for delay in replying – taken out by family issues then holiday for the last month or s… by Simon Gates on this entry
  • Simon, I can see where you're coming from on this. If MCID (in its various guises) is not an optimal… by Chee-Wee Tan on this entry
  • Hi Simon I am currently doing my PhD in clinical based research. We want to use the MCID to determin… by tomwilks on this entry
  • I think your comment reveals how nonsensical null hypothesis testing is (and I see from your other p… by matt on this entry
  • Thanks for commenting Matt – I do wonder if anyone ever looks at any of this, not that this is a pro… by Simon Gates on this entry

Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder