Disclaimer: The tone of this post may have been affected by the results of the British EU referendum.
There has been considerable chat and Twittering about the “fragility index” so I thought I’d take a look. The basic idea is this: researchers get excited about “statistically significant” (p<0.05) results, the standard belief being that if you’ve found “significance” then you have found a real effect. [this is of course wrong, for lots of reasons] But some “significant” results are more reliable than others. For example, if you have a small number of events in your trial, it would only require a few patients to have had different outcomes to tip a “significant” result into “non-significance”. So it would be useful to have a measure of the robustness of statistically significant results, so that readers will get a sense of how reliable they are. The Fragility Index (FI) aims to provide this. It is calculated as the number of patients that would have had to have had different outcomes in order to render the result “non-significant” (p > 0.05). So if a trial had 5/100 with the main outcome in one group and 18/100 in the other, the p-value would be 0.007 (pretty significant, huh?). The fragility index would be 3 (according to the handy online calculator www.fragilityindex.com, which will calculate your p-value to 15 decimal places): only three of the intervention group non-events would need to have been events to raise the p-value above 0.05.
There’s a paper introducing this idea, from 2014:
Walsh M et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. J Clin Epidemiol. 2014 Jun;67(6):622-8. doi: 0.1016/j.jclinepi.2013.10.019. Epub 2014 Feb 5.
I think there are good and bad aspects to this. On the positive side, it’s good that people are thinking about the reliability of “significant” results and acknowledging that just achieving significance doesn’t mean that you’ve found anything important. But to me the Fragility Index doesn’t get you much further forward. If you find a low Fragility Index, what do you do with that information? We have always known that significance when there are few events is unreliable. The problem is really judging that there is a qualitative difference between results that are “significant” and “non-significant”, a zombie myth that the Fragility Index doesn’t do anything to dispel. The justification is that judging results by “significance” is an ingrained habit that isn’t going to go away in a hurry, so the FI will highlight unreliable results and help people to avoid mistakes in interpretation. I have some sympathy with that view, but really, the problem is with the use of significance testing, and we should be promoting things that will help us to move away from this, rather than introducing new procedures that seem to validate it.
There are some things in the paper that I really didn’t like, for example: “The concept of a threshold P-value to determine statistical significance aids our interpretation of trial results.” Really? How exactly does it do that? It just creates an artificial dichotomy based on a nonsensical criterion. The paper tries to explain in the next sentence: “It allows us to distill the complexities of probability theory into a threshold value that informs whether a true difference likely exists”. I have no idea what the first part of that means, but the second part is just dead wrong. No p-value will ever tell you “whether a true difference likely exists” because they are calculated on the assumption that the difference is zero. This is just perpetuating one of the common and disastrous misinterpretations of p-values, and it is pretty surprising that this set of authors gets it wrong. Or maybe it isn’t, considering that almost everyone else does.