April 07, 2014

David Colquhoun agrees with me!

On the hazards of significance testing. Part 2: the false discovery rate, or how not to make a fool of yourself with P values


makes much the same points as I have made elsewhere in this blog, though he doesn't go as far and recommend Bayesian analyses. But I can't see how you can sensibly interpret p-values without a prior, and if you're going to go that far, a fully Bayesian analysis is the natural thing to do surely?

February 20, 2014

Australian homeopathy review – surely they are kidding?

The NHMRC in Australia's strategic plan identified "‘examining alternative therapy claims’ as a major health issue for consideration by the organisation, including the provision of research funding" (http://www.nhmrc.gov.au/your-health/complementary-and-alternative-medicines). Well, that seems OK; they include a wide range of treatments under "complementary and alternative therapies", from the relatively mainstream (meditation and relaxation therapies, osteopathy) to the completely bonkers (homeopathy, reflexology), so it is reasonable to investigate the effectiveness of some of these.

But hold on! Further down the page we find a "Homeopathy review", and NHMRC have convened a "Homeopathy Working Committee" to oversee this. The plan seems to be to conduct an overview of systematic reviews on the effectiveness of homeopathy, to produce an information paper and position statement. The Working Committee includes some eminent names, and one member who, as a teacher of homeopathy, has a clear conflict of interest. I suppose you can argue that it is important to have a content expert in a review team, but in this case, where someone cannot help but have a personal interest in one particular outcome, it doesn't seem right. Like asking a committed Christian to weigh up dispassionately the evidence for the existence of god(s); unlikely to work.

I am somewhat staggered that this review is going ahead as it can only come to one credible conclusion, and I am struggling to understand the NHMRC's motivation. Did the homeopathy lobby push for this as part of its effort to be seen as evidence based and mainstream? Or did the NHMRC think that this was the best way to put homeopathy to bed for good? If the latter, I doubt it will be successful, as there will always be odd "statistically significant" results from trials of homeopathy, caused by bias or chance, that will keep the possibility of effectiveness alive in the minds of the credulous.

I have contacted the Homeopathy Working Committee to encourage them to use Bayesian methods with an appropriate prior!

January 24, 2014


...published a couple of years ago in American Journal of Respiratory and Critical Care Medicine (Wunderink et al Am J Resp Crit Care Med 2011; 183(11): 1561-1568). It's a trial of recombinant tissue factor pathway inhibitor (tifacogin) for patients with severe community acquired pneumonia, and randomised people to tifacogin 0.025mg/kg/h, 0.075 mg/kg/h, or placebo. The rationale for it was that tifacogin seemed to be beneficial in the subgroup with severe community acquired pneumonia in a previous trial of patients with sepsis (which rings alam bells with me, but that's another issue). The trial was international, involving 188 centres, and randomised 238 patients, so a major undertaking.

The interesting point about it was that they performed an interim analysis, as a result of which they stopped randomisation to the higher dose of drug due to lack of effeicacy (futility) but continued to randomise to the lower dose. This seems extraordinary; if the high dose isn't doing anything, it seems pretty unlikely that the low dose would. I could understand it if the high dose was stopped because of toxicity or increase in adverse outcomes, like death, but that doesn't seem to have been the case.

Unsurprisingly, the final trial results showed no difference in mortality between tifacogin (18%) and placebo (17.9%). Has there ever been a case where a promising-looking subgroup result was shown in a subsequent trial to be correct?

December 03, 2013

"Significance testing" and prior probabilities

I came across a helpful account recently of an issue which has been bothering me recently, which is the interpretation of significance tests. It was in a slightly unexpected place – the GraphPad software online statistics guide:


The issue is about how you interpret a “significant” p-value. Say you compare a drug to placebo to see if it cures people, and you get a “significant” effect (p < 0.05). Does that mean the drug works? Not necessarily. Apart the obvious 5% of occasions when you will get a “significant” effect when the drug does nothing, it also depends on the prior probability that a drug is effective. It’s exactly the same issue as with diagnostic tests, where the prevalence of a disease has a huge effect on the positive predictive value of a test. If a disease is very rare, even a test with extremely high sensitivity and specificity can be essentially useless, because almost all of the positives will be false positives.

So it is with trials. If your trial has 80% power, and a 5% Type I error rate, then if the prior probability of a drug being effective is 80% then in 1000 replicates of the experiment you will get:

Prior probability=80%

Drug really works

Drug really doesn't work


P<0.05, “significant”




P>0.05, “not significant”








So in 640/650 (98.46%) occasions where you get a “significant” result, the drug will really be effective. [It would also be effective in nearly half of the experiments with a “non-significant” result (160/350).]

However, if there is only a 10% chance that the drug really works, things look a lot worse.

Prior probability=10%

Drug really works

Drug really doesn't work


P<0.05, “significant”




P>0.05, “not significant”








Now the drug is only really effective in 64% of trials with a “significant” result. With 1% prior probability of the drug’s effectiveness, it really works in only 14% of trials with “significant” results.

So the prior probability of the treatment’s effectiveness is absolutely crucial in interpretation of the results of trials. But I don’t think I have ever seen this mentioned in the results or discussion of a paper. I'm really not sure how you would go about downgrading your confidence in a frequentist result based on the prior probability; there isn't a mechanism for doing this. But this is undoubtedly a major cause of misinterpretation of trial results. When you consider that most trials have pretty low power (maybe 50-60% at best) to detect realistic treatment effects, and that the majority of interventions that are tested probably don't work (maybe at best 20% are effective?), then the false positive rate is going to be substantial.

This is another way in which Bayesian methods score over standard traditional analyses; they force us to consider the prior probabilities of hypotheses, and to include them explicitly in the analysis. The issue seems always to be swept under the carpet in traditional analyses, with potentially disastrous consequences. Actually, saying it is swept under the carpet is probably inaccurate - most people are completely unaware that this is even an issue.

September 26, 2013

Asking the wrong question

A study proposal came across my desk recently about evaluation of a new test for infection in a certain group of patients. The potential benefit was that the test uses a chemical marker of infection that is thought to increase rapidly early in the infection process, so it would potentially allow earlier diagnosis and treatment of the infection.

However, the analysis proposed to look for differences in the levels of the marker between patients with confirmed infection and those without. This is asking the wrong question: the issue is not whether the levels of marker differ between infected and non-infected patients. If this is being proposed as a test that will identify infected patients, presumably there is already a pretty good idea that levels of the marker differ. The important issue here is whether the marker is good at identifying those patients that have real infections i.e. it is a diagnostic question of sensitivity and specificity. The most important number is probably the positive predictive value: if a positive test result misses a lot of infected patients, it isn’t going to be much use in clinical practice.

A similar situation arose in a systematic review we did a few years ago of risk factors for chronic disability after whiplash injury. In this, a number of studies had recorded risk factors of whiplash-injured patients (such as injury severity, pre-existing pain, and so on) and whether they developed long-term problems, then analysed whether the risk factors differed between the patients who had recovered and those who had ongoing problems. Again, this is not addressing the right question. What we want to know is how good are risk factors that a clinician can assess early on at predicting long-term whiplash-associated problems.

September 06, 2013

Missing data in systematic reviews –an unappreciated problem?

Most systematic reviews state in their results sections and abstracts how many studies they included. But you usually find that not all outcomes are reported by all studies; it's quite common for important outcomes to be reported by only a minority of studies. What is usually done in this situation is essentially nothing; the subset of studies that have data is used to calculate the estimated treatment effects and this is presented as the review's result.

For example, in the Cochrane review of "Interventions for preventing falls in older people living in the community," there were 40 studies that evaluated multifactorial interventions (these are interventions that consist of several components, for example exercises for strength or balance, medication review, vision assessment, home hazard assessment etc etc; patients are assessed to find out what risk factors for falling they have and specific interventions for these are then provided). The review looked at the number of fallers as one outcome, and also more importantly, the number of participants sustaining fractures. The meta-analysis of the number of fallers included 34 studies, so only six did not provide data on this outcome. However, the meta-analysis of fractures included only 11 studies (27.5% of the studies included in the review), so the conclusion about fractures is based on an analysis in which most of the data are missing. Obviously, this outcome exists for all studies that were conducted; the participants either had a fracture or didn't during the follow-up period, but we only know about how many did and didn't for 11 trials. For the other 29, the information is missing.

The big problem here is the risk of introducing bias. When conducting trials and considering them for inclusion in a systematic review, incomplete outcome data are one of the criteria for judging risk of bias. A common rule of thumb is that more than 20% missing data can put a study at high risk of bias (though obviously that is simplistic, and its origin is obscure). More than 50% of data missing would be very worrying and you would not expect to put much credence on the results. So surely in a situation like the falls review, 72.5% of missing studies we shoud have major reservations about the estimated treatment effect? Yet treatment effects estimated from a subset of studies are routinely presented on a equal footing with results with small amounts of missing data. This doesn't seem right.

If there is an important outcome (like death) that is only reported by a few studies, and there happens to be a difference in those studies, that is likely to be prominently featured in the review's results and conclusions. But the particpants in all of the other trials either died or didn't die; the results for these trials exist but weren't recorded. It is quite possible that if they were known they would completely negate the positive effects in the trials that reported death. Maybe the reason those two trials reported it was precisely because of the treatment benefir?

[1] Gillespie LD, Robertson MC, Gillespie WJ, Sherrington C, Gates S, Clemson LM, Lamb SE. Interventions for preventing falls in older people living in the community. Cochrane Database of Systematic Reviews 2012, Issue 9. Art. No.: CD007146. DOI: 10.1002/14651858.CD007146.pub3.

June 17, 2013

Sample size and the Minimum Clinically Important Difference

Performing a sample size calculation has become part of the rigmarole of randomized trials and is now expected as a sign of “quality”. For example, the CONSORT guidelines include reporting of a sample size calculation as one of the items that should be included in a trial report, and many quality scales and checklists include presence of a sample size calculation as one of the quality markers. Whether any of this is right or just folklore is an interesting issue that receives little attention. [I’m intending to come back to this issue in future posts]

For now I want to focus on one aspect of sample size calculations that seems to me not to make much sense.

In the usual idealized sample size calculation, a treatment effect that it is desired to detect is assumed. Ideally this should be the “minimum clinically important difference” (MCID); the smallest difference that it would be worthwhile to know about, or the smallest difference that would lead to one treatment being favoured over the other in clinical practice. Obviously this is not an easy thing to calculate, but leaving practical issues to one side for the moment, in an ideal situation you would have a good idea of the MCID. Having established the MCID, this is used as the treatment effect in a standard sample size calculation, based on a significance test (almost always at the 5% significance level) and a specified level of power (almost invariably 80% or 90%). This gives a number of patients that need to be recruited. This number will give a “statistically significant” difference the specified percentage of the time (power) if the true difference is the MCID.

The problem here is that the sample size calculation is based on finding a statistically significant result, not demonstrating that the difference is larger than a certain size. But if you have identified a minimum clinically important difference, what you want to be able to say with a high degree of confidence is whether the treatment effect exceeds it. However, the standard sample size calculation is based on statistical significance, which is equivalent to finding that the difference that is non-zero. Obviously, the upper confidence limit is likely to be close to zero and will only rarely be far enough from zero to exclude the MCID. Hence the standard sample size may have adequate power to show whether there is a non-zero difference, but has very little power to show that the difference exceeds the MCID. Hence most results will be inconclusive; they will show that there is evidence of benefit, but uncertainty that it large enough to be clinically important.

As an example, imagine the MCID is thought to be a risk ratio of 0.75 (a bad outcome occurs in 40% of the control group and 30% of the intervention group). A standard sample size calculation gives 350 participants per group. So you do the trial and (unusually!) the proportions are exactly as expected: 40% in the control and 30% in the intervention group. The calculated risk ratio is 0.75 but the 95% confidence interval around this is 0.61 to 0.92. So you can conclude that the treatment has a non-zero effect but you don’t know whether it exceeds the minimum clinically important difference. With this result you would only have a 50% chance that the real treatment effect exceeded the MCID.

So sizing a trial based on the MCID might seem like a good idea, but in fact if you use the conventional methods, the result is probably not going to give you much information about whether the treatment effect really is bigger than the MCID or not. I suspect that in most cases the excitement of a “statistically significant” result overrides any considerations of the strength of the evidence that the effect size is clinically useful.

Randomised trial of the LUCAS mechanical chest compression device

Follow-up to Diary of a randomised controlled trial 25 July 2008 from Evidence-based everything

Recruitment finally finished on 10th June 2013. Over 400 ambulance service vehicles included, and more than 4300 patients. Fantastic effort by everyone involved.

PS final total sample size was 4471 - I missed out on the sweepstake to predict the final total by 1, as my guess was 4472!

Testing baseline characteristics, the New England Journal and CONSORT

A practice that is often seen in reports of randomised trials is carrying out significance tests on baseline characteristics, in the belief that this will provide useful information. The main reason for significance tests is to test whether the null hypothesis is true, and it is this that motivates testing of baseline characteristics. Investigators want to see whether there is a “significant” difference between the groups at baseline, because they have been brought up to believe that a “statistically significant” difference is a real difference. [I’ll leave aside the logical fallacy in deciding on the truth or otherwise of the null hypothesis based on a p-value – see other posts]. Obviously, with baseline characteristics in a randomised trial, this is pointless, because you already know that the null hypothesis is true i.e. on average there are no differences between the randomised groups, and any differences that are seen are due to chance.

Significance testing of baseline characteristics has been extensively criticised; for example the CONSORT guidelines say:

“Unfortunately significance tests of baseline differences are still common…. Such significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance. Tests of baseline differences are not necessarily wrong, just illogical. Such hypothesis testing is superfluous and can mislead investigators and their readers.”

But significance testing of baseline characteristics has proved very hard to eradicate. Here is an extract from the instructions for authors from the New England Journal of Medicine (I’ve checked and it is still there in June 2013: http://www.nejm.org/page/author-center/manuscript-submission):

“For tables comparing treatment or exposure groups in a randomized trial (usually the first table in the trial report), significant differences between or among groups should be indicated by * for P < 0.05, ** for P < 0.01, and *** for P < 0.001 with an explanation in the footnote if required.” [my bold and underlining]

That is a pretty surprising thing to find in a top journal’s instructions, especially as the next point in the list says that “authors may provide a flow diagram in CONSORT format and all of the information required by the CONSORT checklist”.

The wording of the CONSORT guidance is less than ideal and I hope it will be changed in future revisions. It says “Significance tests assess the probability that observed baseline differences could have occurred by chance…”. This seems a bit misleading, as this isn’t what a p-value means in most cases, though it is more correct for comparisons of baseline characteristics in a randomised trial. The p-value is the probability of getting the data observed (or a more extreme result) calculated (and the significance test performed) if the null hypothesis is true i.e. it is based on the assumption that there is no difference. Obviously it can’t also measure the measure the probability that this assumption is correct.

November 14, 2012

What's wrong with null hypothesis significance testing?

This is my own personal list of the failings of null hypothesis significance testing. The subject has been covered in detail by numerous authors in statistical, medical and psychological literature among other places (references – yes, I’ll add them when I get a minute), yet still hypothesis testing and the p<0.05 culture persists. I will add more points and explanations to the list as time goes on. So, for now, and in no particular order:

1. Significance level is arbitrary. There is no particular reason for choosing p=0.05 as the threshold for “significance” except that Ronald Fisher mentioned it in his early writings on p-values. I wouldn’t want to contradict Sir Ronald on a statistical matter, as he was far cleverer than I am, but he wasn’t advocating p=0.05 as a universal threshold. But this is more or less what it has become. In fact p=0.05 doesn’t represent strong evidence against the null hypothesis.

2. Artificial dichotomy. The division of results into “significant” and “non-significant” encourages erroneous dichotomous thinking; the belief that a “significant” results is real and important, whereas non-significant means there is no effect. None of this is correct. However, this dichotomy is precisely what is required by the Neyman-Pearson hypothesis testing procedure.

3. P-values usually misinterpreted. There is empirical evidence that many or most researchers do not understand what p-values actually mean. There are several published surveys that show that many people are unable to identify true and false statements about p-values. This is more than a bit worrying, when they are widely used to draw conclusions from research.

4. P-values do not tell us what we want to know. The p value is the probability of getting the data (or more extreme data) if there is really zero difference (i.e. prob(data|no difference)). This is not something that we are usually very interested in knowing. Much more relevant is the probability that there is really no difference, given the data that have been observed (prob(no difference|data)) i.e. given the results obtained, how likely is it that there is really no difference. Even more relevant, we want to know things like; given the results observed, how likely is it that there is a clinically important difference? Or; how big is the difference and what is the uncertainty around this?

5. “Statistical significance” is not the same as clinical or scientific significance. It is quite possible (and common) to get a clinically important result that is not statistically significant, and equally (though less common) to have a clinically unimportant result that is statistically significant. This is because p-values depend on the sample size as well as the size of the difference. With a big enough sample size, any difference can be made statistically significant.

6. Calculating a p-value under the assumption that the null hypothesis is true makes little sense because the null hypothesis is almost always false. Two different treatments, or a treatment and placebo or no treatment, will extremely rarely have exactly the same effect on an outcome. The exceptions may be treatments that are known to do absolutely nothing, like homeopathy or reflexology, but even here exactly the same effect would only be expected if the trials could be properly blinded to eliminate the effects of attention from the therapist.

7. P-values are uninformative. Even if you still think that a significance test is a reasonable thing to do, it tells you very little, and nothing that is useful. All it tests is whether a difference is non-zero; it gives no information about the size of the difference or the uncertainty around it, nor even the strength of evidence against the null hypothesis.

8. Instability of p-values on replication. It is a little-appreciated fact that if an experiment is repeated , a quite different p-value can result, and any obtained p-value is only a poor predictor of future p-values. This is less so for extremely small p-values, but more so as they approach the significance threshold.

9. There are several misconceptions and misinterpretations that are frequently made. One of the commonest is that a p-value of less than 0.05 means that the difference is unlikely to be due to chance. But the p-value is calculated on the ASSUMPTION that there is no difference so it obviously cannot say anything about whether or not this assumption is true. For that you need to know how likely the null hypothesis is.

10. Another is that the p-value is the probability that the null hypothesis is true (and hence that 1-p is the probability of the alternative hypothesis is true). Neither of these probabilities has anything to do with the p-value.

11. Yet another common logical fallacy is that if the null hypothesis is not rejected, it is accepted as true. This is the same error as assuming that treatments are the same if they are not found to be “significantly” different – but of course “absence of evidence is not evidence of absence”. Clearly a comparison can give a nonsignificant result for reasons other than the null hypothesis being true.

12. The p-value depends not only on the data, but also on the intention of the experiment. Hence the same set of data can give rise to widely differing p-values, depending on what the intention was when the data were collected (how many subjects were to be included, how many comparisons were to be made, etc). This makes very little intuitive sense. Some good examples are given by Goodman (1999) and Kruschke (2010). The familiar adjustment of p-values for multiple comparisons is one manifestation of this phenomenon. 

Goodman SN. Towards evidence based medical statistics. 1. The p-value fallacy. Annals of Internal Medicine 1999; 130:995-1004

Kruschke, J. Bayesian data analysis. WIREs Cognitive Science 2010; 5(1). DOI: 10.1002/wcs.72

April 2014

Mo Tu We Th Fr Sa Su
Mar |  Today  |
   1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30            

Search this blog



Most recent comments

  • Hi Tom Sorry for delay in replying – taken out by family issues then holiday for the last month or s… by Simon Gates on this entry
  • Simon, I can see where you're coming from on this. If MCID (in its various guises) is not an optimal… by Chee-Wee Tan on this entry
  • Hi Simon I am currently doing my PhD in clinical based research. We want to use the MCID to determin… by tomwilks on this entry
  • I think your comment reveals how nonsensical null hypothesis testing is (and I see from your other p… by matt on this entry
  • Thanks for commenting Matt – I do wonder if anyone ever looks at any of this, not that this is a pro… by Simon Gates on this entry

Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder