April 09, 2015

Journal statistical instructions – is that it??

Writing about web page http://www.resuscitationjournal.com/content/authorinfo

I submitted a manuscript to the journal Resuscitation recently. It's a pretty well-regarded medical journal, with an impact factor (for 2013) of 3.96, so a publication there would be a good solid paper. While formatting the manuscript I had a look at the statistical section of the Instructions for Authors. This is what I found:

"Statistical Methods
* Use nonparametric methods to compare groups when the distribution of the dependent variable is not normal.
* Use measures of uncertainty (e.g. confidence intervals) consistently.
* Report two-sided P values except when one-sided tests are required by study design (e.g., non-inferiority trials). Report P values larger than 0.01 to two decimal places, those between 0.01 and 0.001 to three decimal places; report P values smaller than 0.001 as P<0.001."

That's it! 69 words (including the title), more than half of which (43) are about reporting of p-values. I really don't think that many people would find this very useful (for example, what does "use measures of uncertainty consistently" mean?). Moreover, it seems to start from the premise that statistical analysis IS null hypothesis significance testing, and there are lots of reasons to take issue with that point of view. And finally (for now) it is questionable whether two-sided tests are usually the right thing to do, as we are usually interested in whether a treatment is better than another, not in wheter it is different (better or worse) - won't get further in to that now but suffice to say it is a live issue.

January 25, 2015

"Classical" statistics

There is a tendency to describe traditional frequentist methods as "classical" statistics, often making a contrast with modern Bayesian methods, which are (or at least appear in ther modern guise) much newer and a break with tradition. That's kind of fair enough but I don't like the term classical being applied to traditional statistics, for two main reasons.

1. "Classical" is already in use for describing other types of thing (music. literature, architecture) and it has connotations of quality that aren't really applicable to statistics. These are classical:



2. It's inaccurate. Frequentist statistics dates from the mid 20th century. Bayesian statistics goes back much further, to Laplace (early 19th century) and Bayes (18th century) - so if anything should be called "classical", it is Bayesian methods.

December 23, 2014

The Cochrane diamond

You know, the one at the bottom of your meta-analysis that summarises the pooled result? This one:


Well, I don't like it. Why not? I think it's misleading, because the diamond shape (to me at least) suggests it is representing a probability distribution. It puts you in mind of something like this:

And that seems to make sense - the thick bit of the diamond, where your point estimate is, ought to be the area where the (unknown) true treatment effect would be most likely to be, and the thin points of the diamond are like the tails of the distribution, where the probability of the true value is getting smaller and smaller. That would be absolutely right, if the analysis was giving you a Bayesian credible interval - but it isn't.

It's a frequentist confidence interval, and as lots of people have been showing recently, frequentist confidence intervals do not represent probability distributions. They are just an interval constructed by an algorithm so that, if the experiment were repeated many times, 95% of the intervals would include the true value. They are NOT a distribution of the probability of any value of the treatment effect, conditional on the data, althought that is the way they are almost always interpreted. They don't say anything about the probability of the location of the true value, or even whether it is inside or outside any particular interval.

I think a solid bar would be a more reasonable way to represent the 95% confidence interval.

For more info:

Hoekstra R, Morey, RD, Rouder JN, Wagenmakers EJ. Robust misinterpretation of confidence intervals. Psychon Bull Rev. 2014, DOI 10.3758/s13423-013-0572-3

August 28, 2014

Treatment success in HTA trials: thoughts on Dent & Raftery 2011

Dent & Raftery analysed how many trials funded by HTA showed treatment benefit, harm, or were inconclusive (Trials 2011; 12:109). They found that 24% of trials showed a "significant" result (19% in favour of the new intervention and 5% in favour of control). How many of these results are likely to be correct?

Power and the proportion of interventions that are really effective determine the number of significant results that we see. If all interventions are effective, and power of all the trials is 90%, then 90% of trials will give a "significant" result. If no interventions are effective then we expect 5% of results to be significant; these are all false positives. See the graph below.

Proportion of truly effective trials against proportion found significant, by power

Proportion of truly effective trials in a field (i.e. the “population” of interventions that could be tested) against the proportion of “significant” effects that will be found, for different values of power. The proportion of truly effective interventions always exceeds the proportion found to be significant.

So reading across from 24% significant results; in the population of interventions there could be from about 25% (if power is 90%) to 100% (if power is less than about 25%) effective interventions. Power probably isn't 90%; most trials are deigned to have 80-90% power, usually for an optimistic effect size, so given that many effect sizes are smaller than anticipated and many trials recruit fewer than expected, power might typically be 60-70%. This suggests that there would be around 30% of interventions that are truly effective in the population of interventions evaluated by HTA trials.

How often do significance tests get it right? If power is 70%, 86.86% of significant differences are truly effective interventions, meaning that 13.14% are ineffective. As power decreases, the proportion of significant differences that are really effective decreases - the positive predictive value of a significant result gets worse.


Proportion of significant effects that are truly effective (PPV) for different values of power.

If a low proportion are really effective, a lot of significant effects will be false positives. Low power also makes this worse.

July 17, 2014

The EAGeR trial: Preconception low–dose aspirin and pregnancy outcomes

Lancet Volume 384, Issue 9937, 5–11 July 2014, Pages 29–36

Some extracts from the abstract:
Overall, 1228 women were recruited and randomly assigned between June 15, 2007, and July 15, 2011, 1078 of whom completed the trial and were included in the analysis.
309 (58%) women in the low-dose aspirin group had livebirths, compared with 286 (53%) in the placebo group (p=0·0984; absolute difference in livebirth rate 5·09% [95% CI −0·84 to 11·02]).
Preconception-initiated low-dose aspirin was not significantly associated with livebirth or pregnancy loss in women with one to two previous losses. .... Low-dose aspirin is not recommended for the prevention of pregnancy loss.
So - the interpretation is a so-called "negative" trial i.e. one that did not show any evidence of effectiveness.
BUT... the original planned sample size was 1600, with 1254 included in analyses (the other 346 being the 20% allowance for loss to follow up), which was calculated to have 80% probability of a "significant" result if there was in reality a 10% increase in live births in the intervention group from 75% in the control group.
In fact the trial recruited 1228 and lost 12.2% so only 1078 were included in the analyses (86% of the target). The placebo group incidence was different from expectation (53% compared with 75%) and the treatment effect was about half of that the sample size was calculated on (absolute difference of 5% rather than 10%), though they were more similar expressed as risk ratios than risk differences (1.09 compared with 1.13). Nevertheless the treatment effect was quite a bit smaller than the effect the trial was set up to find.
So is concluding ineffectiveness here reasonable? A 5% improvement in live birth rate could well be important to parents, and it is not at all clear that the 10% difference originally specified represents a "minimum clinically imporant difference". So the trial could easily have mised potentially important benefit. This isn't addressed anywhere in the paper. The conclusions seem to be based mainly on the "non-significant" result (p=0.09), without any consideration of what the trial could realistically have detected.

July 02, 2014

And I agree with David Colquhoun!

David C on the madness of the REF


April 07, 2014

David Colquhoun agrees with me!

On the hazards of significance testing. Part 2: the false discovery rate, or how not to make a fool of yourself with P values


makes much the same points as I have made elsewhere in this blog, though he doesn't go as far and recommend Bayesian analyses. But I can't see how you can sensibly interpret p-values without a prior, and if you're going to go that far, a fully Bayesian analysis is the natural thing to do surely?

February 20, 2014

Australian homeopathy review – surely they are kidding?

The NHMRC in Australia's strategic plan identified "‘examining alternative therapy claims’ as a major health issue for consideration by the organisation, including the provision of research funding" (http://www.nhmrc.gov.au/your-health/complementary-and-alternative-medicines). Well, that seems OK; they include a wide range of treatments under "complementary and alternative therapies", from the relatively mainstream (meditation and relaxation therapies, osteopathy) to the completely bonkers (homeopathy, reflexology), so it is reasonable to investigate the effectiveness of some of these.

But hold on! Further down the page we find a "Homeopathy review", and NHMRC have convened a "Homeopathy Working Committee" to oversee this. The plan seems to be to conduct an overview of systematic reviews on the effectiveness of homeopathy, to produce an information paper and position statement. The Working Committee includes some eminent names, and one member who, as a teacher of homeopathy, has a clear conflict of interest. I suppose you can argue that it is important to have a content expert in a review team, but in this case, where someone cannot help but have a personal interest in one particular outcome, it doesn't seem right. Like asking a committed Christian to weigh up dispassionately the evidence for the existence of god(s); unlikely to work.

I am somewhat staggered that this review is going ahead as it can only come to one credible conclusion, and I am struggling to understand the NHMRC's motivation. Did the homeopathy lobby push for this as part of its effort to be seen as evidence based and mainstream? Or did the NHMRC think that this was the best way to put homeopathy to bed for good? If the latter, I doubt it will be successful, as there will always be odd "statistically significant" results from trials of homeopathy, caused by bias or chance, that will keep the possibility of effectiveness alive in the minds of the credulous.

I have contacted the Homeopathy Working Committee to encourage them to use Bayesian methods with an appropriate prior!

UPDATE 25 July 2014

The report has been published and you can read it here: https://www.nhmrc.gov.au/your-health/complementary-medicines/homeopathy-review.

The conclusion is less than scintillating:

"There is a paucity of good-quality studies of sufficient size that examine the effectiveness of homeopathy as a treatment for any clinical condition in humans. The available evidence is not compelling and fails to demonstrate that homeopathy is an effective treatment for any of the reported clinical conditions in humans."

At least it concluded lack of effectiveness, but the comments on the lack of good quality studies might encourage people to keep doing homeopathy studies - which would in my view be completely misguided.

January 24, 2014


...published a couple of years ago in American Journal of Respiratory and Critical Care Medicine (Wunderink et al Am J Resp Crit Care Med 2011; 183(11): 1561-1568). It's a trial of recombinant tissue factor pathway inhibitor (tifacogin) for patients with severe community acquired pneumonia, and randomised people to tifacogin 0.025mg/kg/h, 0.075 mg/kg/h, or placebo. The rationale for it was that tifacogin seemed to be beneficial in the subgroup with severe community acquired pneumonia in a previous trial of patients with sepsis (which rings alam bells with me, but that's another issue). The trial was international, involving 188 centres, and randomised 238 patients, so a major undertaking.

The interesting point about it was that they performed an interim analysis, as a result of which they stopped randomisation to the higher dose of drug due to lack of efficacy (futility) but continued to randomise to the lower dose. This seems extraordinary; if the high dose isn't doing anything, it seems pretty unlikely that the low dose would. I could understand it if the high dose was stopped because of toxicity or increase in adverse outcomes, like death, but that doesn't seem to have been the case.

Unsurprisingly, the final trial results showed no difference in mortality between tifacogin (18%) and placebo (17.9%). Has there ever been a case where a promising-looking subgroup result was shown in a subsequent trial to be correct?

December 03, 2013

"Significance testing" and prior probabilities

I came across a helpful account recently of an issue which has been bothering me recently, which is the interpretation of significance tests. It was in a slightly unexpected place – the GraphPad software online statistics guide:


The issue is about how you interpret a “significant” p-value. Say you compare a drug to placebo to see if it cures people, and you get a “significant” effect (p < 0.05). Does that mean the drug works? Not necessarily. Apart the obvious 5% of occasions when you will get a “significant” effect when the drug does nothing, it also depends on the prior probability that a drug is effective. It’s exactly the same issue as with diagnostic tests, where the prevalence of a disease has a huge effect on the positive predictive value of a test. If a disease is very rare, even a test with extremely high sensitivity and specificity can be essentially useless, because almost all of the positives will be false positives.

So it is with trials. If your trial has 80% power, and a 5% Type I error rate, then if the prior probability of a drug being effective is 80% then in 1000 replicates of the experiment you will get:

Prior probability=80%

Drug really works

Drug really doesn't work


P<0.05, “significant”




P>0.05, “not significant”








So in 640/650 (98.46%) occasions where you get a “significant” result, the drug will really be effective. [It would also be effective in nearly half of the experiments with a “non-significant” result (160/350).]

However, if there is only a 10% chance that the drug really works, things look a lot worse.

Prior probability=10%

Drug really works

Drug really doesn't work


P<0.05, “significant”




P>0.05, “not significant”








Now the drug is only really effective in 64% of trials with a “significant” result. With 1% prior probability of the drug’s effectiveness, it really works in only 14% of trials with “significant” results.

So the prior probability of the treatment’s effectiveness is absolutely crucial in interpretation of the results of trials. But I don’t think I have ever seen this mentioned in the results or discussion of a paper. I'm really not sure how you would go about downgrading your confidence in a frequentist result based on the prior probability; there isn't a mechanism for doing this. But this is undoubtedly a major cause of misinterpretation of trial results. When you consider that most trials have pretty low power (maybe 50-60% at best) to detect realistic treatment effects, and that the majority of interventions that are tested probably don't work (maybe at best 20% are effective?), then the false positive rate is going to be substantial.

This is another way in which Bayesian methods score over standard traditional analyses; they force us to consider the prior probabilities of hypotheses, and to include them explicitly in the analysis. The issue seems always to be swept under the carpet in traditional analyses, with potentially disastrous consequences. Actually, saying it is swept under the carpet is probably inaccurate - most people are completely unaware that this is even an issue.

April 2015

Mo Tu We Th Fr Sa Su
Mar |  Today  |
      1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30         

Search this blog



Most recent comments

  • Hi Tom Sorry for delay in replying – taken out by family issues then holiday for the last month or s… by Simon Gates on this entry
  • Simon, I can see where you're coming from on this. If MCID (in its various guises) is not an optimal… by Chee-Wee Tan on this entry
  • Hi Simon I am currently doing my PhD in clinical based research. We want to use the MCID to determin… by tomwilks on this entry
  • I think your comment reveals how nonsensical null hypothesis testing is (and I see from your other p… by matt on this entry
  • Thanks for commenting Matt – I do wonder if anyone ever looks at any of this, not that this is a pro… by Simon Gates on this entry

Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder