All entries for Monday 17 June 2013
June 17, 2013
Performing a sample size calculation has become part of the rigmarole of randomized trials and is now expected as a sign of “quality”. For example, the CONSORT guidelines include reporting of a sample size calculation as one of the items that should be included in a trial report, and many quality scales and checklists include presence of a sample size calculation as one of the quality markers. Whether any of this is right or just folklore is an interesting issue that receives little attention. [I’m intending to come back to this issue in future posts]
For now I want to focus on one aspect of sample size calculations that seems to me not to make much sense.
In the usual idealized sample size calculation, a treatment effect that it is desired to detect is assumed. Ideally this should be the “minimum clinically important difference” (MCID); the smallest difference that it would be worthwhile to know about, or the smallest difference that would lead to one treatment being favoured over the other in clinical practice. Obviously this is not an easy thing to calculate, but leaving practical issues to one side for the moment, in an ideal situation you would have a good idea of the MCID. Having established the MCID, this is used as the treatment effect in a standard sample size calculation, based on a significance test (almost always at the 5% significance level) and a specified level of power (almost invariably 80% or 90%). This gives a number of patients that need to be recruited. This number will give a “statistically significant” difference the specified percentage of the time (power) if the true difference is the MCID.
The problem here is that the sample size calculation is based on finding a statistically significant result, not demonstrating that the difference is larger than a certain size. But if you have identified a minimum clinically important difference, what you want to be able to say with a high degree of confidence is whether the treatment effect exceeds it. However, the standard sample size calculation is based on statistical significance, which is equivalent to finding that the difference that is non-zero. Obviously, the upper confidence limit is likely to be close to zero and will only rarely be far enough from zero to exclude the MCID. Hence the standard sample size may have adequate power to show whether there is a non-zero difference, but has very little power to show that the difference exceeds the MCID. Hence most results will be inconclusive; they will show that there is evidence of benefit, but uncertainty that it large enough to be clinically important.
As an example, imagine the MCID is thought to be a risk ratio of 0.75 (a bad outcome occurs in 40% of the control group and 30% of the intervention group). A standard sample size calculation gives 350 participants per group. So you do the trial and (unusually!) the proportions are exactly as expected: 40% in the control and 30% in the intervention group. The calculated risk ratio is 0.75 but the 95% confidence interval around this is 0.61 to 0.92. So you can conclude that the treatment has a non-zero effect but you don’t know whether it exceeds the minimum clinically important difference. With this result you would only have a 50% chance that the real treatment effect exceeded the MCID.
So sizing a trial based on the MCID might seem like a good idea, but in fact if you use the conventional methods, the result is probably not going to give you much information about whether the treatment effect really is bigger than the MCID or not. I suspect that in most cases the excitement of a “statistically significant” result overrides any considerations of the strength of the evidence that the effect size is clinically useful.
Follow-up to Diary of a randomised controlled trial 25 July 2008 from Evidence-based everything
Recruitment finally finished on 10th June 2013. Over 400 ambulance service vehicles included, and more than 4300 patients. Fantastic effort by everyone involved.
PS final total sample size was 4471 - I missed out on the sweepstake to predict the final total by 1, as my guess was 4472!
A practice that is often seen in reports of randomised trials is carrying out significance tests on baseline characteristics, in the belief that this will provide useful information. The main reason for significance tests is to test whether the null hypothesis is true, and it is this that motivates testing of baseline characteristics. Investigators want to see whether there is a “significant” difference between the groups at baseline, because they have been brought up to believe that a “statistically significant” difference is a real difference. [I’ll leave aside the logical fallacy in deciding on the truth or otherwise of the null hypothesis based on a p-value – see other posts]. Obviously, with baseline characteristics in a randomised trial, this is pointless, because you already know that the null hypothesis is true i.e. on average there are no differences between the randomised groups, and any differences that are seen are due to chance.
Significance testing of baseline characteristics has been extensively criticised; for example the CONSORT guidelines say:
“Unfortunately significance tests of baseline differences are still common…. Such significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance. Tests of baseline differences are not necessarily wrong, just illogical. Such hypothesis testing is superfluous and can mislead investigators and their readers.”
But significance testing of baseline characteristics has proved very hard to eradicate. Here is an extract from the instructions for authors from the New England Journal of Medicine (I’ve checked and it is still there in June 2013: http://www.nejm.org/page/author-center/manuscript-submission):
“For tables comparing treatment or exposure groups in a randomized trial (usually the first table in the trial report), significant differences between or among groups should be indicated by * for P < 0.05, ** for P < 0.01, and *** for P < 0.001 with an explanation in the footnote if required.” [my bold and underlining]
That is a pretty surprising thing to find in a top journal’s instructions, especially as the next point in the list says that “authors may provide a flow diagram in CONSORT format and all of the information required by the CONSORT checklist”.
The wording of the CONSORT guidance is less than ideal and I hope it will be changed in future revisions. It says “Significance tests assess the probability that observed baseline differences could have occurred by chance…”. This seems a bit misleading, as this isn’t what a p-value means in most cases, though it is more correct for comparisons of baseline characteristics in a randomised trial. The p-value is the probability of getting the data observed (or a more extreme result) calculated (and the significance test performed) if the null hypothesis is true i.e. it is based on the assumption that there is no difference. Obviously it can’t also measure the measure the probability that this assumption is correct.