June 17, 2013

Testing baseline characteristics, the New England Journal and CONSORT

A practice that is often seen in reports of randomised trials is carrying out significance tests on baseline characteristics, in the belief that this will provide useful information. The main reason for significance tests is to test whether the null hypothesis is true, and it is this that motivates testing of baseline characteristics. Investigators want to see whether there is a “significant” difference between the groups at baseline, because they have been brought up to believe that a “statistically significant” difference is a real difference. [I’ll leave aside the logical fallacy in deciding on the truth or otherwise of the null hypothesis based on a p-value – see other posts]. Obviously, with baseline characteristics in a randomised trial, this is pointless, because you already know that the null hypothesis is true i.e. on average there are no differences between the randomised groups, and any differences that are seen are due to chance.

Significance testing of baseline characteristics has been extensively criticised; for example the CONSORT guidelines say:

“Unfortunately significance tests of baseline differences are still common…. Such significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance. Tests of baseline differences are not necessarily wrong, just illogical. Such hypothesis testing is superfluous and can mislead investigators and their readers.”

But significance testing of baseline characteristics has proved very hard to eradicate. Here is an extract from the instructions for authors from the New England Journal of Medicine (I’ve checked and it is still there in June 2013: http://www.nejm.org/page/author-center/manuscript-submission):

“For tables comparing treatment or exposure groups in a randomized trial (usually the first table in the trial report), significant differences between or among groups should be indicated by * for P < 0.05, ** for P < 0.01, and *** for P < 0.001 with an explanation in the footnote if required.” [my bold and underlining]

That is a pretty surprising thing to find in a top journal’s instructions, especially as the next point in the list says that “authors may provide a flow diagram in CONSORT format and all of the information required by the CONSORT checklist”.

The wording of the CONSORT guidance is less than ideal and I hope it will be changed in future revisions. It says “Significance tests assess the probability that observed baseline differences could have occurred by chance…”. This seems a bit misleading, as this isn’t what a p-value means in most cases, though it is more correct for comparisons of baseline characteristics in a randomised trial. The p-value is the probability of getting the data observed (or a more extreme result) calculated (and the significance test performed) if the null hypothesis is true i.e. it is based on the assumption that there is no difference. Obviously it can’t also measure the measure the probability that this assumption is correct.

- 3 comments by 0 or more people Not publicly viewable

  1. matt

    Not sure I agree with you about this. If you’ve got a significant difference on some baseline measure when the groups were assigned randomly it tells you quite a lot about how you should be interpreting any significant (or non-significant for that matter) differences in your dependent variable. It might, for example, suggest that your two groups differ (by chance) on some important factor which imperfectly correlates with your covariate and your dependent variable. It’s possible that this would be enough to destroy a “real” between-groups difference on your dependent variable, even if you control for your covariate. In general my view is that significant differences on covariates means that you should interpret all results from the study with added caution. That’s quite useful information I think.

    19 Jun 2013, 19:36

  2. Simon Gates

    Thanks for commenting Matt – I do wonder if anyone ever looks at any of this, not that this is a problem, as I just post stuff here as a way of collecting interesting issues. If it stimulates debate so much the better.

    Anyway, I think the point here is that “significance” or otherwise of baseline differences is inappropriate. Usually in a statistical test there are two possible explanations for a low p-value; the data are unusual, or the null hypothesis is wrong. When groups have been assigned randomly, you know that the null hypothesis is true (there is no difference, on average at least), so a low p-value tells you that the data are unusual. This is expected to happen sometimes, so it doesn’;t really tell you much. I agree that differences in baseline characteristics that ocur by chance could still obscure a real difference in the outcome that you’re interested in, so need to be taken into account. But you will be able to detect these by looking at the size of the difference between the groups – I can’t see that a significance test adds anything here.

    20 Jun 2013, 11:44

  3. matt

    I think your comment reveals how nonsensical null hypothesis testing is (and I see from your other posts that you agree about this). Clearly the null hypothesis isn’t ever true (even in this random case): if you randomised the whole population into two groups and measured them on some characteristic, the means would not be identical to infinitely many decimal places (or at least we know from measure theory that the probability of that happening would be zero). So I guess my response to your comment is I agree that “significance” is inappropriate here, but no more inappropriate than it is anywhere. If you think about hypothesis testing as a useful decision-making heuristic for determining whether or not observed differences should be of interest to you, then it is as useful in this case as it is anywhere else. Your suggested alternative (looking at the effect size) is I guess the more sophisticated option, but it lacks the clarity of the hypothesis test decision-making heuristic (there are no clear guidelines about what effect sizes are too big etc).

    20 Jun 2013, 16:23

Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

June 2013

Mo Tu We Th Fr Sa Su
May |  Today  | Jul
               1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30

Search this blog



Most recent comments

  • Hi Tom Sorry for delay in replying – taken out by family issues then holiday for the last month or s… by Simon Gates on this entry
  • Simon, I can see where you're coming from on this. If MCID (in its various guises) is not an optimal… by Chee-Wee Tan on this entry
  • Hi Simon I am currently doing my PhD in clinical based research. We want to use the MCID to determin… by tomwilks on this entry
  • I think your comment reveals how nonsensical null hypothesis testing is (and I see from your other p… by matt on this entry
  • Thanks for commenting Matt – I do wonder if anyone ever looks at any of this, not that this is a pro… by Simon Gates on this entry

Blog archive

Not signed in
Sign in

Powered by BlogBuilder