June 24, 2016

The Fragility Index for clinical trials

Disclaimer: The tone of this post may have been affected by the results of the British EU referendum.

There has been considerable chat and Twittering about the “fragility index” so I thought I’d take a look. The basic idea is this: researchers get excited about “statistically significant” (p<0.05) results, the standard belief being that if you’ve found “significance” then you have found a real effect. [this is of course wrong, for lots of reasons] But some “significant” results are more reliable than others. For example, if you have a small number of events in your trial, it would only require a few patients to have had different outcomes to tip a “significant” result into “non-significance”. So it would be useful to have a measure of the robustness of statistically significant results, so that readers will get a sense of how reliable they are. The Fragility Index (FI) aims to provide this. It is calculated as the number of patients that would have had to have had different outcomes in order to render the result “non-significant” (p > 0.05). So if a trial had 5/100 with the main outcome in one group and 18/100 in the other, the p-value would be 0.007 (pretty significant, huh?). The fragility index would be 3 (according to the handy online calculator www.fragilityindex.com, which will calculate your p-value to 15 decimal places): only three of the intervention group non-events would need to have been events to raise the p-value above 0.05.

There’s a paper introducing this idea, from 2014:
Walsh M et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. J Clin Epidemiol. 2014 Jun;67(6):622-8. doi: 0.1016/j.jclinepi.2013.10.019. Epub 2014 Feb 5.

I think there are good and bad aspects to this. On the positive side, it’s good that people are thinking about the reliability of “significant” results and acknowledging that just achieving significance doesn’t mean that you’ve found anything important. But to me the Fragility Index doesn’t get you much further forward. If you find a low Fragility Index, what do you do with that information? We have always known that significance when there are few events is unreliable. The problem is really judging that there is a qualitative difference between results that are “significant” and “non-significant”, a zombie myth that the Fragility Index doesn’t do anything to dispel. The justification is that judging results by “significance” is an ingrained habit that isn’t going to go away in a hurry, so the FI will highlight unreliable results and help people to avoid mistakes in interpretation. I have some sympathy with that view, but really, the problem is with the use of significance testing, and we should be promoting things that will help us to move away from this, rather than introducing new procedures that seem to validate it.

There are some things in the paper that I really didn’t like, for example: “The concept of a threshold P-value to determine statistical significance aids our interpretation of trial results.” Really? How exactly does it do that? It just creates an artificial dichotomy based on a nonsensical criterion. The paper tries to explain in the next sentence: “It allows us to distill the complexities of probability theory into a threshold value that informs whether a true difference likely exists”. I have no idea what the first part of that means, but the second part is just dead wrong. No p-value will ever tell you “whether a true difference likely exists” because they are calculated on the assumption that the difference is zero. This is just perpetuating one of the common and disastrous misinterpretations of p-values, and it is pretty surprising that this set of authors gets it wrong. Or maybe it isn’t, considering that almost everyone else does.

April 14, 2016

NEJM letter and cardiac arrest trial

I recently had a letter in the New England Journal of Medicine, about a trial they had published that compared continuous versus interrupted chest compressions during resuscitation after cardiac arrest. Interrupted compressions are standard care - the interruptions are for ventilations to oxygenate the blood, prior to resuming chest compressions to keep it circulating. The issue was that the result of the trial was 0.7% better survival in the interrupted-compression group, with 95% CI from -1.5% to 0.1%. So the data are suggesting a probable benefit to interrupted compressions. However, on Twitter the NEJM announced this as “no difference”, no doubt because the difference was not “statistically significant”. So I wrote pointing out that this wasn’t a good interpretation, and the dichotomy into “significant” and “non-significant” is pretty unhelpful in situations where the results are close to “significance”. Bayesian methods have a huge advantage here, in that they can actually quantify the probability of benefit. An 80% probability that the treatment is beneficial is a lot more useful than “non-significance”, and might lead to very different actions.

The letter was published along with a very brief reply from the authors (they were probably constrained, as I was in the original letter, by a tiny word limit): “Bayesian analyses of trials are said to offer some advantages over traditional frequentist analyses. A limitation of the former is that different people have different prior beliefs about the effect of treatment. Alternative interpretations of our results offered by others show that there was not widespread clinical consensus on these prior beliefs. We are not responsible for how the trial results were interpreted on Twitter.”

Taking the last point first: no, the authors did not write the Twitter post. But they also did not object to it. I'm not accusing them of making the error that non-significance = no difference, but it is so common that it usually - as here - passes without comment. But it's just wrong.

Their initial point about priors illustrates a common view, that Bayesian analysis is about incorporating individual prior beliefs into the analysis. While you can do this, it is neither necessary nor a primary aim. As Andrew Gelman has said (and I have repeated before); prior information not prior beliefs. We want to base a prior on the information that we have at the start of the trial, and if that is no information, then that’s fine. However, we almost always do have some information on what the treatment effect might plausibly be. For example, it’s very unusual to find an odds ratio of 10 in any trial, so an appropriate prior would make effects of this (implausible) size unlikely. More importantly, in this case, getting too hung up on priors is a bit irrelevant, because the trial was so huge (over 20,000 participants) that the data will completely swamp any reasonable prior.

It isn’t possible to re-create the analysis from the information in the paper, as it was a cluster-randomised trial with crossover, which needs to be taken into account. Just using the outcome data for survival to discharge in a quick and dirty Bayesian analysis, though, gives a 95% credible interval of something like from 0.84 to 1.00, with a probability of the odds ratio being less than 1 of about 98%. That probably isn’t too far away from the correct result, and suggests pretty strongly that survival may be a bit worse in the continuous compression group. “No difference” just doesn’t seem like an adequate summary to me.

My letter and the authors’ reply are here: http://www.nejm.org/doi/full/10.1056/NEJMc1600144

The original trial report is here: Nichol G, Leroux B, Wang H, et al. Trial of continuous or interrupted chest compressions during CPR. N Engl J Med 2015;373:2203-2214 http://www.nejm.org/doi/full/10.1056/NEJMoa1509139

December 09, 2015

Why do they say that?

A thing I've heard several times is that Bayesian methods might be advantageous for Phase 2 trials but not for Phase 3. I've struggled to understand why people would think that. To me, the advantage of Bayesian methods comes in the fact that the methods make sense, answer relevant questions and give understandable answers, which seem just as important in Phase 3 trials as in Phase 2.

One of my colleagues gave me his explanation, which I will paraphrase. He made two points:

1. Decision-making processes are different after Phase 2 and Phase 3 trials; folowing Phase 2 decisions about whether to proceed further are made by researchers or research funders, but after Phase 3 decisons (about use of therapies presumably) are taken by "society" in the form of regulators or healthcare providers. This makes the Bayesian approach harder as it is harder to formulate a sensible prior (for Phase 3 I think he means).

2. In Phase 3 trials sample sizes are larger so the prior is almost always swamped by the data, so Bayesian methods don't add anything.

My answer to point 1: Bayesian methods are about more than priors. I think this criticism comes from the (limited and in my view somewhat misguided) view of priors as a personal belief. That is one way of specifying them but not the most useful way. As Andrew Gelman has said, prior INFORMATION not prior BELIEF. And you can probably specify information in pretty much the same way for both Phase 2 and Phase 3 trials.

My answer to point 2: Bayesian methods aren't just about including prior information in the analysis (though they are great for doing that if you want to). I'll reiterate my reasons for preferring them that I gave earlier - the methods make sense, answer relevant questions and give understandable answers. Why would you want to use a method that doesn't answer the question and nobody understands? Also, If you DO have good prior information, you can reach an answer more quickly by incorporating that in the analysis - which we kind of do by doing trials and then combining them with others in meta-analyses; but doing it the Bayesian way would be neater and more efficient.

September 18, 2015

Even heroes get it wrong sometimes

I recently read David Sackett's 2004 paper from Evidence-based Medicine “Superiority trials, non-inferiority trials, and prisoners of the 2-sided null hypothesis “ (Evid Based Med 2004;9:38-39 doi:10.1136/ebm.9.2.38). [links don’t seem to be working, will edit later if I can].

In it I found this:

“As it happened, our 1-sided analysis revealed that the probability that our nurse practitioners’ patients were worse off (by ⩾5%) than our general practitioners’ patients was as small as 0.008.”

I’m pretty sure that 0.008 probability isn’t from a Bayesian analysis and is a misinterpretation of a p-value. It isn’t the probability of the null hypothesis being false! It really isn’t! Obviously that got past the reviewers of this manuscript without comment.

Edit: I've got the paper now. It's a result from a one-tailed test for non-inferiority. The null hypothesis is that the intervention group was worse by 5% or more on their measure of function, p=0.008 so they reject the hypothesis of inferiority. But, as usual, that's the probability of getting the data (or more extreme data) if the null hypothesis is true - not the probability of the null hypothesis.

May 02, 2015

New test can predict cancer. Oh no it can't!

A story in several UK papers including the Telegraph suggests that a test measuring telomere length can predict who will develop cancer "up to 13 years" before it appears. Some of the re-postings have (seemingly by a process of Chinese whispers) elaborated this into "A test that can predict with 100 per cent accuracy whether someone will develop cancer up to 13 years in the future has been devised by scientists" (New Zealand Herald) - which sounds pretty unlikely.

What they are talking about is this study, which analysed telomere lengths in a cohort of people, some of whom developed cancer.

It's hard to know where to start with this. there are two levels of nonsense going on here; the media hype, which has very little to do with the results of the study, and the study itself, which seems to come to conclusions that are way beyond what the data suggest, through a combination of over-reliance on sgnificance testing, poor methodology and wishful thinking. I'll leave the media hype to one side, as it's well-established that reporting of studies often bears little relation to what the study actually did; in this case, there was no "test" and no "100% accuracy". But what about what the researchers really found out, or thought they did?

The paper makes two major claims:

1. "Age-related BTL attrition was faster in cancer cases pre-diagnosis than in cancer-free participants" (that's verbatim from their abstract);

2. "all participants had similar age-adjusted BTL 8–14 years pre-diagnosis, followed by decelerated attrition in cancer cases resulting in longer BTL three and four years pre-diagnosis" (also vebatim from their abstract, edited to remove p-values).

They studied a cohort of 579 initially cancer-free US veterans who were followed up annually between 1999 and 2012, with blood being taken 1-4 times from each participant. About half had only one or two blood samples, so there isn't much in the way of within-patient comparisons of telomere length over time. Telomere length was measured from these blood samples (this was some kind of average, but I'll assume intra-individual variation isn't important).

Figure 1 illustrates the first result:

Full-size image (36 K)

The regression lines do look as though there is a steeper slope through the cancer group, and the interaction is "significant" (p-0.032 when unadjusted and p=0.017 adjusted) - but what is ignored in the interpretation is the enormous scatter around both of the regression lines. Without the lines on the graph you wouldn't be able to tell whether there was any difference in the slopes. Additionally, as relatively few participants had multiple readings, it isn't possible to do the analysis of comparing within-patient measures of change in telomere length, which might be less noisy. Instead we have an analysis of average telomere length at each age, with a changing set of patients. So, on this evidence, it is hard to imagine how this could ever be a useful test that would be any good for distinguishing people who will develop cancer from those who will not. The claim of a difference seems to come entirely from the "statistical significance" of the interaction.

The second claim, that in people who develop cancer BTL stops declining and reaches a plateau 3-4 years pre-diagnosis, derives from their Figure 2:

Full-size image (47 K)

Again, the claim derives from the difference between the two lines being "statistically significant" at 3-5 years pre-diagnosis, and not elsewhere. But looking at the red line, it really doesn't look like a steady decline, followed by a plateau in the last few years. If anything, the telomere length is high in the last few years, and the "significance" is caused by particularly low values in the cancer-free group in those years. I'm not sure that this plot is showing what they think it shows; the x-axis for the cancer group is years pre-diagnosis, but for the non-cancer group it is years pre-censoring, so it seems likely that the non-cancer group will be older at each point on the x axis. Diagnoses of cancer could happen at any time, whereas most censoring is likely to happen at or near the end of the study. If BTL declines with age, that could potentially produce this sort of effect. So I'm pretty unconvinced. The claim seems to result from looking primarily at "statistical significance" of comparisons at each time point, which seems to have trumped any sense-checking.

April 09, 2015

Journal statistical instructions – is that it??

Writing about web page http://www.resuscitationjournal.com/content/authorinfo

I submitted a manuscript to the journal Resuscitation recently. It's a pretty well-regarded medical journal, with an impact factor (for 2013) of 3.96, so a publication there would be a good solid paper. While formatting the manuscript I had a look at the statistical section of the Instructions for Authors. This is what I found:

"Statistical Methods
* Use nonparametric methods to compare groups when the distribution of the dependent variable is not normal.
* Use measures of uncertainty (e.g. confidence intervals) consistently.
* Report two-sided P values except when one-sided tests are required by study design (e.g., non-inferiority trials). Report P values larger than 0.01 to two decimal places, those between 0.01 and 0.001 to three decimal places; report P values smaller than 0.001 as P<0.001."

That's it! 69 words (including the title), more than half of which (43) are about reporting of p-values. I really don't think that many people would find this very useful (for example, what does "use measures of uncertainty consistently" mean?). Moreover, it seems to start from the premise that statistical analysis IS null hypothesis significance testing, and there are lots of reasons to take issue with that point of view. And finally (for now) it is questionable whether two-sided tests are usually the right thing to do, as we are usually interested in whether a treatment is better than another, not in whether it is different (not bothered whether it is better or worse) - won't get further in to that now but suffice to say it is a live issue.

January 25, 2015

"Classical" statistics

There is a tendency to describe traditional frequentist methods as "classical" statistics, often making a contrast with modern Bayesian methods, which are (or at least appear in ther modern guise) much newer and a break with tradition. That's kind of fair enough but I don't like the term classical being applied to traditional statistics, for two main reasons.

1. "Classical" is already in use for describing other types of thing (music. literature, architecture) and it has connotations of quality that aren't really applicable to statistics. These are classical:



2. It's inaccurate. Frequentist statistics dates from the mid 20th century. Bayesian statistics goes back much further, to Laplace (early 19th century) and Bayes (18th century) - so if anything should be called "classical", it is Bayesian methods.

December 23, 2014

The Cochrane diamond

You know, the one at the bottom of your meta-analysis that summarises the pooled result? This one:


Well, I don't like it. Why not? I think it's misleading, because the diamond shape (to me at least) suggests it is representing a probability distribution. It puts you in mind of something like this:

And that seems to make sense - the thick bit of the diamond, where your point estimate is, ought to be the area where the (unknown) true treatment effect would be most likely to be, and the thin points of the diamond are like the tails of the distribution, where the probability of the true value is getting smaller and smaller. That would be absolutely right, if the analysis was giving you a Bayesian credible interval - but it isn't.

It's a frequentist confidence interval, and as lots of people have been showing recently, frequentist confidence intervals do not represent probability distributions. They are just an interval constructed by an algorithm so that, if the experiment were repeated many times, 95% of the intervals would include the true value. They are NOT a distribution of the probability of any value of the treatment effect, conditional on the data, althought that is the way they are almost always interpreted. They don't say anything about the probability of the location of the true value, or even whether it is inside or outside any particular interval.

I think a solid bar would be a more reasonable way to represent the 95% confidence interval.

For more info:

Hoekstra R, Morey, RD, Rouder JN, Wagenmakers EJ. Robust misinterpretation of confidence intervals. Psychon Bull Rev. 2014, DOI 10.3758/s13423-013-0572-3

August 28, 2014

Treatment success in HTA trials: thoughts on Dent & Raftery 2011

Dent & Raftery analysed how many trials funded by HTA showed treatment benefit, harm, or were inconclusive (Trials 2011; 12:109). They found that 24% of trials showed a "significant" result (19% in favour of the new intervention and 5% in favour of control). How many of these results are likely to be correct?

Power and the proportion of interventions that are really effective determine the number of significant results that we see. If all interventions are effective, and power of all the trials is 90%, then 90% of trials will give a "significant" result. If no interventions are effective then we expect 5% of results to be significant; these are all false positives. See the graph below.

Proportion of truly effective trials against proportion found significant, by power

Proportion of truly effective trials in a field (i.e. the “population” of interventions that could be tested) against the proportion of “significant” effects that will be found, for different values of power. The proportion of truly effective interventions always exceeds the proportion found to be significant.

So reading across from 24% significant results; in the population of interventions there could be from about 25% (if power is 90%) to 100% (if power is less than about 25%) effective interventions. Power probably isn't 90%; most trials are deigned to have 80-90% power, usually for an optimistic effect size, so given that many effect sizes are smaller than anticipated and many trials recruit fewer than expected, power might typically be 60-70%. This suggests that there would be around 30% of interventions that are truly effective in the population of interventions evaluated by HTA trials.

How often do significance tests get it right? If power is 70%, 86.86% of significant differences are truly effective interventions, meaning that 13.14% are ineffective. As power decreases, the proportion of significant differences that are really effective decreases - the positive predictive value of a significant result gets worse.


Proportion of significant effects that are truly effective (PPV) for different values of power.

If a low proportion are really effective, a lot of significant effects will be false positives. Low power also makes this worse.

July 17, 2014

The EAGeR trial: Preconception low–dose aspirin and pregnancy outcomes

Lancet Volume 384, Issue 9937, 5–11 July 2014, Pages 29–36

Some extracts from the abstract:
Overall, 1228 women were recruited and randomly assigned between June 15, 2007, and July 15, 2011, 1078 of whom completed the trial and were included in the analysis.
309 (58%) women in the low-dose aspirin group had livebirths, compared with 286 (53%) in the placebo group (p=0·0984; absolute difference in livebirth rate 5·09% [95% CI −0·84 to 11·02]).
Preconception-initiated low-dose aspirin was not significantly associated with livebirth or pregnancy loss in women with one to two previous losses. .... Low-dose aspirin is not recommended for the prevention of pregnancy loss.
So - the interpretation is a so-called "negative" trial i.e. one that did not show any evidence of effectiveness.
BUT... the original planned sample size was 1600, with 1254 included in analyses (the other 346 being the 20% allowance for loss to follow up), which was calculated to have 80% probability of a "significant" result if there was in reality a 10% increase in live births in the intervention group from 75% in the control group.
In fact the trial recruited 1228 and lost 12.2% so only 1078 were included in the analyses (86% of the target). The placebo group incidence was different from expectation (53% compared with 75%) and the treatment effect was about half of that the sample size was calculated on (absolute difference of 5% rather than 10%), though they were more similar expressed as risk ratios than risk differences (1.09 compared with 1.13). Nevertheless the treatment effect was quite a bit smaller than the effect the trial was set up to find.
So is concluding ineffectiveness here reasonable? A 5% improvement in live birth rate could well be important to parents, and it is not at all clear that the 10% difference originally specified represents a "minimum clinically imporant difference". So the trial could easily have mised potentially important benefit. This isn't addressed anywhere in the paper. The conclusions seem to be based mainly on the "non-significant" result (p=0.09), without any consideration of what the trial could realistically have detected.

August 2016

Mo Tu We Th Fr Sa Su
Jul |  Today  |
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31            

Search this blog



Most recent comments

  • Hi Tom Sorry for delay in replying – taken out by family issues then holiday for the last month or s… by Simon Gates on this entry
  • Simon, I can see where you're coming from on this. If MCID (in its various guises) is not an optimal… by Chee-Wee Tan on this entry
  • Hi Simon I am currently doing my PhD in clinical based research. We want to use the MCID to determin… by tomwilks on this entry
  • I think your comment reveals how nonsensical null hypothesis testing is (and I see from your other p… by matt on this entry
  • Thanks for commenting Matt – I do wonder if anyone ever looks at any of this, not that this is a pro… by Simon Gates on this entry

Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder