<link rel="alternate" href="https://blogs.warwick.ac.uk/simongates" />
<link rel="self" href="https://blogs.warwick.ac.uk/simongates/newentries/?num=10&start=10&atom=atom" />
<contributor>
<name>Simon Gates</name>
</contributor>
<subtitle />
<id>https://blogs.warwick.ac.uk/simongates/newentries/?num=10&start=10&atom=atom</id>
<generator uri="https://blogs.warwick.ac.uk">Warwick Blogs, University of Warwick</generator>
<rights>(C) 2020 Simon Gates</rights>
<updated>2020-08-03T12:51:07Z</updated>
<entry>
<title>Bayesian methods and trials in rare and common diseases by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/bayesian_methods_and/
2016-12-14T10:00:31Z
2016-12-14T10:00:31Z
<p>One of the places that Bayesian methods have made some progress in the clinical trials world is in very rare diseases. And it’s true, traditional methods are hopeless in this situation, where you can never get enough recruits to get anywhere near the sample size that traditional methods demand for an “adequately powered” study, and it’s unlikely that a result will be “statistically significant”. Bayesian methods really help here, because they give you a result in terms of probability that a treatment is superior. This is good for two main reasons. First, it’s helpful to quantify the probability of benefit, and its size and uncertainty. This tells us a lot more than simply dichotomising it into “significant” and “non-significant”, with the unstated assumption that “significant” means clinically useful. Second, there isn’t a fixed probability of benefit that means an intervention should be used; it will vary from situation to situation. For example, if there is almost no cost to using a treatment, it might only need a small probability of being better to be worthwhile. If we don’t estimate this probability we can’t make this sort of judgement.</p>
<p>But (and this is something I have experienced several times now in a variety of places so I think it is real) – this seems to have had an unfortunate side effect. A perception seems to have grown that Bayesian methods are something to consider using when a “proper” trial (with all of the usual stuff: interpretation based on p < 0.05 in a null hypothesis test, fixed pre-planned sample size based on a significance test, 80% or 90% power and so on) isn’t feasible. In reality, the ability to quantify probability of benefit would be helpful in just about all situations, even (or especially) large Phase 3 trials that are looking for modest treatment benefits. How many of these trials don’t “achieve statistical significance” but have results that would show a 70% or 80% probability of benefit? They might still provide good enough evidence to make decisions about treatments (based on, for example, cost-effectiveness), but at the moment they tend to get labelled as “non-significant” or “negative trials.”</p>
<p>One of the places that Bayesian methods have made some progress in the clinical trials world is in very rare diseases. And it’s true, traditional methods are hopeless in this situation, where you can never get enough recruits to get anywhere near the sample size that traditional methods demand for an “adequately powered” study, and it’s unlikely that a result will be “statistically significant”. Bayesian methods really help here, because they give you a result in terms of probability that a treatment is superior. This is good for two main reasons. First, it’s helpful to quantify the probability of benefit, and its size and uncertainty. This tells us a lot more than simply dichotomising it into “significant” and “non-significant”, with the unstated assumption that “significant” means clinically useful. Second, there isn’t a fixed probability of benefit that means an intervention should be used; it will vary from situation to situation. For example, if there is almost no cost to using a treatment, it might only need a small probability of being better to be worthwhile. If we don’t estimate this probability we can’t make this sort of judgement.</p>
<p>But (and this is something I have experienced several times now in a variety of places so I think it is real) – this seems to have had an unfortunate side effect. A perception seems to have grown that Bayesian methods are something to consider using when a “proper” trial (with all of the usual stuff: interpretation based on p < 0.05 in a null hypothesis test, fixed pre-planned sample size based on a significance test, 80% or 90% power and so on) isn’t feasible. In reality, the ability to quantify probability of benefit would be helpful in just about all situations, even (or especially) large Phase 3 trials that are looking for modest treatment benefits. How many of these trials don’t “achieve statistical significance” but have results that would show a 70% or 80% probability of benefit? They might still provide good enough evidence to make decisions about treatments (based on, for example, cost-effectiveness), but at the moment they tend to get labelled as “non-significant” or “negative trials.”</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-12-14T10:00:31Z
0
“The probability that the results are due to chance” by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/8220the_probability_that/
2016-11-12T09:08:32Z
2016-11-12T09:08:32Z
<p>One of the (wrong) explanations that you often see of what a p-value means is “the probability that data have arisen by chance.” I think people may struggle to see why this is wrong, as I did for a long time. A p-value is the probability of getting the data (or more extreme data) if the null hypothesis (no difference) is correct – right? So that would mean the specific result you got must have been due to chance variation, doesn’t it? So why isn’t the p-value the probability that the result was due to chance?</p>
<p>The problem is that there are two ways of interpreting “the probability that a result is due to chance.” <br />
1. The probability that chance or random variation was the process that produced the result;<br />
2. The probability of getting the specific data (or more extreme data) that you got in your experiment, if chance was the only process operating.</p>
<p>The second of these is what the p-value tells you; but the first is the interpretation that most people give it. The p-value tells you nothing about the process that produced the result, because it is calculated on the assumption that the null hypothesis is correct.</p>
<p>One of the (wrong) explanations that you often see of what a p-value means is “the probability that data have arisen by chance.” I think people may struggle to see why this is wrong, as I did for a long time. A p-value is the probability of getting the data (or more extreme data) if the null hypothesis (no difference) is correct – right? So that would mean the specific result you got must have been due to chance variation, doesn’t it? So why isn’t the p-value the probability that the result was due to chance?</p>
<p>The problem is that there are two ways of interpreting “the probability that a result is due to chance.” <br />
1. The probability that chance or random variation was the process that produced the result;<br />
2. The probability of getting the specific data (or more extreme data) that you got in your experiment, if chance was the only process operating.</p>
<p>The second of these is what the p-value tells you; but the first is the interpretation that most people give it. The p-value tells you nothing about the process that produced the result, because it is calculated on the assumption that the null hypothesis is correct.</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-11-12T09:08:32Z
0
Statistical significance and decision-making by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/statistical_significance_and/
2016-11-03T15:58:04Z
2016-11-03T15:57:06Z
<p>One of the defences of the use of traditional “null hypothesis significance testing” (NHST) in clinical trials is that, at some point, it is necessary to make a decision about whether a treatment should be used, and “statistical significance” gives us a way of doing that. I hear versions of this argument on a regular basis.</p>
<p>But the argument has always seemed to me to be ridiculous. Even if significance tests could tell you that the null hypothesis was wrong (they can’t), that doesn’t give you any basis for a sensible decision. A null hypothesis being wrong doesn’t tell you whether the treatment has a big enough effect to be worth implementing, and it takes no account of other important things, like cost-effectiveness, safety, feasibility or patient acceptability. Not a good basis for what are potentially life and death decisions.</p>
<p>But don’t listen to me: listen to The American Statistical Association. Their Statement on Statistical Significance and P-Values from earlier this year addresses exactly this point. The third of their principles is:</p>
<p><strong>“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”</strong></p>
<p>Pretty unambiguous, I think.</p>
<p>One of the defences of the use of traditional “null hypothesis significance testing” (NHST) in clinical trials is that, at some point, it is necessary to make a decision about whether a treatment should be used, and “statistical significance” gives us a way of doing that. I hear versions of this argument on a regular basis.</p>
<p>But the argument has always seemed to me to be ridiculous. Even if significance tests could tell you that the null hypothesis was wrong (they can’t), that doesn’t give you any basis for a sensible decision. A null hypothesis being wrong doesn’t tell you whether the treatment has a big enough effect to be worth implementing, and it takes no account of other important things, like cost-effectiveness, safety, feasibility or patient acceptability. Not a good basis for what are potentially life and death decisions.</p>
<p>But don’t listen to me: listen to The American Statistical Association. Their Statement on Statistical Significance and P-Values from earlier this year addresses exactly this point. The third of their principles is:</p>
<p><strong>“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.”</strong></p>
<p>Pretty unambiguous, I think.</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-11-03T15:58:04Z
0
“Something is rotten in the state of Denmark” by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/8220something_is_rotten/
2016-11-03T15:54:44Z
2016-10-12T11:02:32Z
<p>The <span class="caps">DANISH</span> trial (in which, pleasingly, the D stands for “Danish”, and it was conducted in Denmark too), evaluated the use of Implantable Cardioverter Defibrillators (ICD) in patients with heart failure that was not due to ischaemic heart disease. The idea of the intervention is that it can automatically restart the heart in the event of a sudden cardiac arrest – so it might help these patients, who are at increased risk of their heart stopping suddenly (obviously there is a lot more clinical detail to this).</p>
<p>The trial recruited 1116 patients and found that the primary outcome (death from any cause) occurred in 120/556 (21.6%) in the <span class="caps">ICD</span> group and 131/560 (23.4%) in control; a hazard ratio of 0.87, 95% <span class="caps">CI 0</span>.68, 1.12. The conclusion was (from the abstract):</p>
<p>“prophylactic <span class="caps">ICD</span> implantation … was not associated with a significantly lower long-term rate of death from any cause than was usual clinical care”;</p>
<p>and from the end of the paper:</p>
<p>“prophylactic <span class="caps">ICD</span> implantation … was not found to reduce longterm mortality.”</p>
<p>Note, in passing, the subtle change from “no significant difference” in the abstract, which at least has a chance of being interpreted as a statement about statistics, to “not found to reduce mortality” – a statement about the clinical effects. Of course the result doesn’t mean that, but the error is so common as to be completely invisible.</p>
<p>Reporting of the trial mostly put it across as showing no survival improvement, for example:<br />
<a href="https://healthmanagement.org/c/cardio/news/danish-trial-icds-in-non-ischaemic-heart-failure">https://healthmanagement.org/c/cardio/news/danish-trial-icds-in-non-ischaemic-heart-failure</a><br />
<a href="http://www.medscape.com/viewarticle/868065">http://www.medscape.com/viewarticle/868065</a><br />
<a href="http://www.tctmd.com/show.aspx?id=136105">http://www.tctmd.com/show.aspx?id=136105</a></p>
<p>The main issue in this trial, however, was that the <span class="caps">ICD</span> intervention <span class="caps">DID</span> reduce sudden cardiac death, which is what the intervention is supposed to do: 24/556 (4.3%) in the <span class="caps">ICD</span> group and 46/560 (8.2%) in control, hazard ratio 0.50 (0.31, 0.82). All cardiovascular deaths (sudden and non-sudden) were also reduced in the <span class="caps">ICD</span> group, but not by so much: <span class="caps">HR 0</span>.77 (0.57, 1.05). You might expect a result like this if the <span class="caps">ICD</span> reduced sudden cardiac deaths, but in addition to this both groups have similar risk of non-sudden cardiac death. When all deaths are counted (including cardiac and other causes), the difference in the outcome that the intervention can affect starts getting swamped by outcomes that it doesn’t reduce. The sudden cardiac deaths make up a small proportion of the total, so the overall difference between the groups is dominated by deaths that weren’t likely to differ between the groups, and the difference in all-cause mortality is much smaller (and “non-significant”). So all of the results seem consistent with the intervention reducing the thing it is intended to reduce, by quite a lot, but there also being a lot of deaths due to other causes that aren’t affected by the intervention. To get my usual point in, if Bayesian methods were used, you would find a substantially greater probability of benefit for the intervention for cardiovascular death and all-cause mortality.</p>
<p>All-cause death was chosen as the primary outcome, and following convention, the conclusions are based on this. But the conclusion is sensitive to the choice of primary outcome: if sudden cardiac death had been the primary outcome, the trial would have been regarded as “positive”.</p>
<p>So, finally, to get around to the general issues. It is the convention in trials to nominate a single “primary outcome”, which is used for calculating a target sample size and for drawing the main conclusions of the study. Usually this comes down to saying there was benefit (“positive trial”) if the result gets a p-value of less than 0.05, and not if the p-value exceeds 0.05 (“negative trial”). The expectation is that a single primary outcome will be nominated (sometimes you can get away with two), but that means that the conclusions of the trial will be sensitive to this choice. I think the reason for having a single primary outcome stems from concerns over type I errors if lots of outcomes are analysed. You could them claim a “positive” trial and treatment effectiveness if any of them turned out “significant” – though obviously restricting yourself to a single primary outcome is a pretty blunt instrument for addressing multiple analysis issues.</p>
<p>There are lots of situations where it isn’t clear that a single outcome is sufficient for drawing conclusions from a trial, as in <span class="caps">DANISH</span>: the intervention should help by reducing sudden cardiac death, but that won’t be any help if it increases deaths for other reasons – so both sudden cardiac deaths and overall deaths are important. Good interpretation isn’t helped by the conventions (=bad habits) of equating “statistical significance” with clinical importance, and labelling the treatment as effective or not based on a single primary outcome.</p>
<p>Reference for <span class="caps">DANISH</span> trial: N Engl J Med 2016; 375:1221-1230, September 29, 2016<br />
<a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1608029">http://www.nejm.org/doi/full/10.1056/NEJMoa1608029</a></p>
<p>The <span class="caps">DANISH</span> trial (in which, pleasingly, the D stands for “Danish”, and it was conducted in Denmark too), evaluated the use of Implantable Cardioverter Defibrillators (ICD) in patients with heart failure that was not due to ischaemic heart disease. The idea of the intervention is that it can automatically restart the heart in the event of a sudden cardiac arrest – so it might help these patients, who are at increased risk of their heart stopping suddenly (obviously there is a lot more clinical detail to this).</p>
<p>The trial recruited 1116 patients and found that the primary outcome (death from any cause) occurred in 120/556 (21.6%) in the <span class="caps">ICD</span> group and 131/560 (23.4%) in control; a hazard ratio of 0.87, 95% <span class="caps">CI 0</span>.68, 1.12. The conclusion was (from the abstract):</p>
<p>“prophylactic <span class="caps">ICD</span> implantation … was not associated with a significantly lower long-term rate of death from any cause than was usual clinical care”;</p>
<p>and from the end of the paper:</p>
<p>“prophylactic <span class="caps">ICD</span> implantation … was not found to reduce longterm mortality.”</p>
<p>Note, in passing, the subtle change from “no significant difference” in the abstract, which at least has a chance of being interpreted as a statement about statistics, to “not found to reduce mortality” – a statement about the clinical effects. Of course the result doesn’t mean that, but the error is so common as to be completely invisible.</p>
<p>Reporting of the trial mostly put it across as showing no survival improvement, for example:<br />
<a href="https://healthmanagement.org/c/cardio/news/danish-trial-icds-in-non-ischaemic-heart-failure">https://healthmanagement.org/c/cardio/news/danish-trial-icds-in-non-ischaemic-heart-failure</a><br />
<a href="http://www.medscape.com/viewarticle/868065">http://www.medscape.com/viewarticle/868065</a><br />
<a href="http://www.tctmd.com/show.aspx?id=136105">http://www.tctmd.com/show.aspx?id=136105</a></p>
<p>The main issue in this trial, however, was that the <span class="caps">ICD</span> intervention <span class="caps">DID</span> reduce sudden cardiac death, which is what the intervention is supposed to do: 24/556 (4.3%) in the <span class="caps">ICD</span> group and 46/560 (8.2%) in control, hazard ratio 0.50 (0.31, 0.82). All cardiovascular deaths (sudden and non-sudden) were also reduced in the <span class="caps">ICD</span> group, but not by so much: <span class="caps">HR 0</span>.77 (0.57, 1.05). You might expect a result like this if the <span class="caps">ICD</span> reduced sudden cardiac deaths, but in addition to this both groups have similar risk of non-sudden cardiac death. When all deaths are counted (including cardiac and other causes), the difference in the outcome that the intervention can affect starts getting swamped by outcomes that it doesn’t reduce. The sudden cardiac deaths make up a small proportion of the total, so the overall difference between the groups is dominated by deaths that weren’t likely to differ between the groups, and the difference in all-cause mortality is much smaller (and “non-significant”). So all of the results seem consistent with the intervention reducing the thing it is intended to reduce, by quite a lot, but there also being a lot of deaths due to other causes that aren’t affected by the intervention. To get my usual point in, if Bayesian methods were used, you would find a substantially greater probability of benefit for the intervention for cardiovascular death and all-cause mortality.</p>
<p>All-cause death was chosen as the primary outcome, and following convention, the conclusions are based on this. But the conclusion is sensitive to the choice of primary outcome: if sudden cardiac death had been the primary outcome, the trial would have been regarded as “positive”.</p>
<p>So, finally, to get around to the general issues. It is the convention in trials to nominate a single “primary outcome”, which is used for calculating a target sample size and for drawing the main conclusions of the study. Usually this comes down to saying there was benefit (“positive trial”) if the result gets a p-value of less than 0.05, and not if the p-value exceeds 0.05 (“negative trial”). The expectation is that a single primary outcome will be nominated (sometimes you can get away with two), but that means that the conclusions of the trial will be sensitive to this choice. I think the reason for having a single primary outcome stems from concerns over type I errors if lots of outcomes are analysed. You could them claim a “positive” trial and treatment effectiveness if any of them turned out “significant” – though obviously restricting yourself to a single primary outcome is a pretty blunt instrument for addressing multiple analysis issues.</p>
<p>There are lots of situations where it isn’t clear that a single outcome is sufficient for drawing conclusions from a trial, as in <span class="caps">DANISH</span>: the intervention should help by reducing sudden cardiac death, but that won’t be any help if it increases deaths for other reasons – so both sudden cardiac deaths and overall deaths are important. Good interpretation isn’t helped by the conventions (=bad habits) of equating “statistical significance” with clinical importance, and labelling the treatment as effective or not based on a single primary outcome.</p>
<p>Reference for <span class="caps">DANISH</span> trial: N Engl J Med 2016; 375:1221-1230, September 29, 2016<br />
<a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1608029">http://www.nejm.org/doi/full/10.1056/NEJMoa1608029</a></p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-11-03T15:54:44Z
0
Classical statistics revisited by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/classical_statistics_revisited/
2016-10-02T13:06:15Z
2016-10-02T09:04:47Z
<p><img src="/images/simongates/2016/10/02/oppy.jpg?maxWidth=500" alt="oppy.jpg" align="left" border="0" /></p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p>I’ve written before about the use of the term “classical” to refer to traditional frequentist statistics. I recently found that E.T Jaynes had covered this ground over 30 years ago. In “The Intuitive Inadequacy of Classical Statistics” [1] he writes: </p>
<p> “What variety of statistics is meant by classical? J.R. Oppenheimer held that in science the word “classical” has a special meaning: “[…] it means “wrong”. That is, the classical theory is the one which is wrong, but which was held yesterday to be right.” </p>
<p> “… in other fields, “classical” carries the opposite connotations of “having great and timeless merit." Classical music, sculpture and architecture are the kind I like.”</p>
<p> Jaynes follows convention, and Oppenheimer, in the article and means traditional stats by “classical”. I guess the Oppenheimer meaning should be understood more generally.</p>
<p>[1] Epistemologia VII (1984) Special Issue. Probability, Statistics and Inductive Logic pp 43-74</p>
<p><img src="/images/simongates/2016/10/02/oppy.jpg?maxWidth=500" alt="oppy.jpg" align="left" border="0" /></p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p><br />
</p>
<p>I’ve written before about the use of the term “classical” to refer to traditional frequentist statistics. I recently found that E.T Jaynes had covered this ground over 30 years ago. In “The Intuitive Inadequacy of Classical Statistics” [1] he writes: </p>
<p> “What variety of statistics is meant by classical? J.R. Oppenheimer held that in science the word “classical” has a special meaning: “[…] it means “wrong”. That is, the classical theory is the one which is wrong, but which was held yesterday to be right.” </p>
<p> “… in other fields, “classical” carries the opposite connotations of “having great and timeless merit." Classical music, sculpture and architecture are the kind I like.”</p>
<p> Jaynes follows convention, and Oppenheimer, in the article and means traditional stats by “classical”. I guess the Oppenheimer meaning should be understood more generally.</p>
<p>[1] Epistemologia VII (1984) Special Issue. Probability, Statistics and Inductive Logic pp 43-74</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-10-02T13:06:15Z
0
Radio 4 does statistical significance by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/radio_4_does/
2016-09-23T15:21:17Z
2016-09-23T15:21:17Z
<p>There was an item on “Today” on Radio 4 on 22 September about Family Drug and Alcohol Courts – which essentially are a different type of court system for dealing with issues about the care of children in families affected by drugs and alcohol. I know nothing about the topic, but it seems they offer a much more supportive approach and are claimed to be more successful at keeping parents off drugs and alcohol and reducing disruption to family life.</p>
<p>This item featured an interview with one of the authors, Mary Ryan, of a new report comparing the effectiveness of Family Drug and Alcohol Courts with the standard system: keeping children with their parents, and keeping parents off drugs and alcohol. Twice she said that differences they found were “statistically significant”, emphasising the “statistically”, and the phrase was also repeated by the Radio 4 presenter.</p>
<p>I would be pretty confident that the presenter, almost all of the audience, and very possibly Mary Ryan, have no idea what the technical meaning of “statistically significant” is. But the words have everyday meanings that we understand, and when put together they sound as though a result must be important, impressive and reliable. It’s “significant” – that means it’s important, right? And it’s not just ordinary significance, but “statistical” significance – that means that it’s backed up by statistics, which is science, so we can be sure it’s true.</p>
<p>I don’t know for sure, but I would guess that this is the sort of understanding that most people would take from a discussion on Radio 4 of “statistically significant” results. It’s a problem of using familiar words to refer to specific technical concepts; people can understand the words without understanding the concept.</p>
<p>Just after writing this I came across this blog post from Alex Etz which confirms what I thought, with numbers and everything:<br />
<a href="https://alexanderetz.com/2015/08/03/the-general-public-has-no-idea-what-statistically-significant-means/">https://alexanderetz.com/2015/08/03/the-general-public-has-no-idea-what-statistically-significant-means/</a></p>
<p>There was an item on “Today” on Radio 4 on 22 September about Family Drug and Alcohol Courts – which essentially are a different type of court system for dealing with issues about the care of children in families affected by drugs and alcohol. I know nothing about the topic, but it seems they offer a much more supportive approach and are claimed to be more successful at keeping parents off drugs and alcohol and reducing disruption to family life.</p>
<p>This item featured an interview with one of the authors, Mary Ryan, of a new report comparing the effectiveness of Family Drug and Alcohol Courts with the standard system: keeping children with their parents, and keeping parents off drugs and alcohol. Twice she said that differences they found were “statistically significant”, emphasising the “statistically”, and the phrase was also repeated by the Radio 4 presenter.</p>
<p>I would be pretty confident that the presenter, almost all of the audience, and very possibly Mary Ryan, have no idea what the technical meaning of “statistically significant” is. But the words have everyday meanings that we understand, and when put together they sound as though a result must be important, impressive and reliable. It’s “significant” – that means it’s important, right? And it’s not just ordinary significance, but “statistical” significance – that means that it’s backed up by statistics, which is science, so we can be sure it’s true.</p>
<p>I don’t know for sure, but I would guess that this is the sort of understanding that most people would take from a discussion on Radio 4 of “statistically significant” results. It’s a problem of using familiar words to refer to specific technical concepts; people can understand the words without understanding the concept.</p>
<p>Just after writing this I came across this blog post from Alex Etz which confirms what I thought, with numbers and everything:<br />
<a href="https://alexanderetz.com/2015/08/03/the-general-public-has-no-idea-what-statistically-significant-means/">https://alexanderetz.com/2015/08/03/the-general-public-has-no-idea-what-statistically-significant-means/</a></p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-09-23T15:21:17Z
0
Feel the Significance by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/feel_the_significance/
2016-09-22T14:02:27Z
2016-09-22T14:02:27Z
<p>Pleasantly mangled interpretation of p-values that I came across recently:<br />
(STT is Student t-test and <span class="caps">WTT</span> is Wilcoxon t-test)</p>
<p>“The two-tailed z-tests produced calculated p-values of < 1.0 × 10−6 for <span class="caps">STT</span> and<br />
<span class="caps">WTT</span> at α = 0.05. As the calculated p-values are much less than α, the Null Hypothesis is rejected which therefore proves that there is a significant difference between the two groups, i.e. low and high risk.”</p>
<p>From: Batty CA, et al (2015) Use of the Analysis of the Volatile Faecal Metabolome in Screening for Colorectal Cancer. PLoS <span class="caps">ONE 10</span>(6): e0130301. doi:10.1371/journal.pone.0130301</p>
<p>Pleasantly mangled interpretation of p-values that I came across recently:<br />
(STT is Student t-test and <span class="caps">WTT</span> is Wilcoxon t-test)</p>
<p>“The two-tailed z-tests produced calculated p-values of < 1.0 × 10−6 for <span class="caps">STT</span> and<br />
<span class="caps">WTT</span> at α = 0.05. As the calculated p-values are much less than α, the Null Hypothesis is rejected which therefore proves that there is a significant difference between the two groups, i.e. low and high risk.”</p>
<p>From: Batty CA, et al (2015) Use of the Analysis of the Volatile Faecal Metabolome in Screening for Colorectal Cancer. PLoS <span class="caps">ONE 10</span>(6): e0130301. doi:10.1371/journal.pone.0130301</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-09-22T14:02:27Z
0
The Fragility Index for clinical trials by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/the_fragility_index/
2016-06-24T09:33:09Z
2016-06-24T09:33:09Z
<p>Disclaimer: The tone of this post may have been affected by the results of the British EU referendum.</p>
<p>There has been considerable chat and Twittering about the “fragility index” so I thought I’d take a look. The basic idea is this: researchers get excited about “statistically significant” (p<0.05) results, the standard belief being that if you’ve found “significance” then you have found a real effect. [this is of course wrong, for lots of reasons] But some “significant” results are more reliable than others. For example, if you have a small number of events in your trial, it would only require a few patients to have had different outcomes to tip a “significant” result into “non-significance”. So it would be useful to have a measure of the robustness of statistically significant results, so that readers will get a sense of how reliable they are. The Fragility Index (FI) aims to provide this. It is calculated as the number of patients that would have had to have had different outcomes in order to render the result “non-significant” (p > 0.05). So if a trial had 5/100 with the main outcome in one group and 18/100 in the other, the p-value would be 0.007 (pretty significant, huh?). The fragility index would be 3 (according to the handy online calculator <a href="http://www.fragilityindex.com">www.fragilityindex.com</a>, which will calculate your p-value to 15 decimal places): only three of the intervention group non-events would need to have been events to raise the p-value above 0.05.</p>
<p>There’s a paper introducing this idea, from 2014: <br />
Walsh M et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. J Clin Epidemiol. 2014 Jun;67(6):622-8. doi: 0.1016/j.jclinepi.2013.10.019. Epub 2014 Feb 5.</p>
<p>I think there are good and bad aspects to this. On the positive side, it’s good that people are thinking about the reliability of “significant” results and acknowledging that just achieving significance doesn’t mean that you’ve found anything important. But to me the Fragility Index doesn’t get you much further forward. If you find a low Fragility Index, what do you do with that information? We have always known that significance when there are few events is unreliable. The problem is really judging that there is a qualitative difference between results that are “significant” and “non-significant”, a zombie myth that the Fragility Index doesn’t do anything to dispel. The justification is that judging results by “significance” is an ingrained habit that isn’t going to go away in a hurry, so the FI will highlight unreliable results and help people to avoid mistakes in interpretation. I have some sympathy with that view, but really, the problem is with the use of significance testing, and we should be promoting things that will help us to move away from this, rather than introducing new procedures that seem to validate it.</p>
<p>There are some things in the paper that I really didn’t like, for example: “The concept of a threshold P-value to determine statistical significance aids our interpretation of trial results.” Really? How exactly does it do that? It just creates an artificial dichotomy based on a nonsensical criterion. The paper tries to explain in the next sentence: “It allows us to distill the complexities of probability theory into a threshold value that informs whether a true difference likely exists”. I have no idea what the first part of that means, but the second part is just dead wrong. No p-value will ever tell you “whether a true difference likely exists” because they are calculated on the assumption that the difference is zero. This is just perpetuating one of the common and disastrous misinterpretations of p-values, and it is pretty surprising that this set of authors gets it wrong. Or maybe it isn’t, considering that almost everyone else does.</p>
<p>Disclaimer: The tone of this post may have been affected by the results of the British EU referendum.</p>
<p>There has been considerable chat and Twittering about the “fragility index” so I thought I’d take a look. The basic idea is this: researchers get excited about “statistically significant” (p<0.05) results, the standard belief being that if you’ve found “significance” then you have found a real effect. [this is of course wrong, for lots of reasons] But some “significant” results are more reliable than others. For example, if you have a small number of events in your trial, it would only require a few patients to have had different outcomes to tip a “significant” result into “non-significance”. So it would be useful to have a measure of the robustness of statistically significant results, so that readers will get a sense of how reliable they are. The Fragility Index (FI) aims to provide this. It is calculated as the number of patients that would have had to have had different outcomes in order to render the result “non-significant” (p > 0.05). So if a trial had 5/100 with the main outcome in one group and 18/100 in the other, the p-value would be 0.007 (pretty significant, huh?). The fragility index would be 3 (according to the handy online calculator <a href="http://www.fragilityindex.com">www.fragilityindex.com</a>, which will calculate your p-value to 15 decimal places): only three of the intervention group non-events would need to have been events to raise the p-value above 0.05.</p>
<p>There’s a paper introducing this idea, from 2014: <br />
Walsh M et al. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. J Clin Epidemiol. 2014 Jun;67(6):622-8. doi: 0.1016/j.jclinepi.2013.10.019. Epub 2014 Feb 5.</p>
<p>I think there are good and bad aspects to this. On the positive side, it’s good that people are thinking about the reliability of “significant” results and acknowledging that just achieving significance doesn’t mean that you’ve found anything important. But to me the Fragility Index doesn’t get you much further forward. If you find a low Fragility Index, what do you do with that information? We have always known that significance when there are few events is unreliable. The problem is really judging that there is a qualitative difference between results that are “significant” and “non-significant”, a zombie myth that the Fragility Index doesn’t do anything to dispel. The justification is that judging results by “significance” is an ingrained habit that isn’t going to go away in a hurry, so the FI will highlight unreliable results and help people to avoid mistakes in interpretation. I have some sympathy with that view, but really, the problem is with the use of significance testing, and we should be promoting things that will help us to move away from this, rather than introducing new procedures that seem to validate it.</p>
<p>There are some things in the paper that I really didn’t like, for example: “The concept of a threshold P-value to determine statistical significance aids our interpretation of trial results.” Really? How exactly does it do that? It just creates an artificial dichotomy based on a nonsensical criterion. The paper tries to explain in the next sentence: “It allows us to distill the complexities of probability theory into a threshold value that informs whether a true difference likely exists”. I have no idea what the first part of that means, but the second part is just dead wrong. No p-value will ever tell you “whether a true difference likely exists” because they are calculated on the assumption that the difference is zero. This is just perpetuating one of the common and disastrous misinterpretations of p-values, and it is pretty surprising that this set of authors gets it wrong. Or maybe it isn’t, considering that almost everyone else does.</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-06-24T09:33:09Z
0
NEJM letter and cardiac arrest trial by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/nejm_letter_and/
2016-04-14T12:04:20Z
2016-04-14T11:41:47Z
<p>I recently had a letter in the New England Journal of Medicine, about a trial they had published that compared continuous versus interrupted chest compressions during resuscitation after cardiac arrest. Interrupted compressions are standard care - the interruptions are for ventilations to oxygenate the blood, prior to resuming chest compressions to keep it circulating. The issue was that the result of the trial was 0.7% better survival in the interrupted-compression group, with 95% CI from -1.5% to 0.1%. So the data are suggesting a probable benefit to interrupted compressions. However, on Twitter the NEJM announced this as “no difference”, no doubt because the difference was not “statistically significant”. So I wrote pointing out that this wasn’t a good interpretation, and the dichotomy into “significant” and “non-significant” is pretty unhelpful in situations where the results are close to “significance”. Bayesian methods have a huge advantage here, in that they can actually quantify the probability of benefit. An 80% probability that the treatment is beneficial is a lot more useful than “non-significance”, and might lead to very different actions. </p>
<p> The letter was published along with a very brief reply from the authors (they were probably constrained, as I was in the original letter, by a tiny word limit): <em>“Bayesian analyses of trials are said to offer some advantages over traditional frequentist analyses. A limitation of the former is that different people have different prior beliefs about the effect of treatment. Alternative interpretations of our results offered by others show that there was not widespread clinical consensus on these prior beliefs. We are not responsible for how the trial results were interpreted on Twitter.”</em> </p>
<p>Taking the last point first: no, the authors did not write the Twitter post. But they also did not object to it. I'm not accusing them of making the error that non-significance = no difference, but it is so common that it usually - as here - passes without comment. But it's just wrong. </p>
<p> Their initial point about priors illustrates a common view, that Bayesian analysis is about incorporating individual prior beliefs into the analysis. While you can do this, it is neither necessary nor a primary aim. As Andrew Gelman has said (and I have repeated before); prior information not prior beliefs. We want to base a prior on the information that we have at the start of the trial, and if that is no information, then that’s fine. However, we almost always do have some information on what the treatment effect might plausibly be. For example, it’s very unusual to find an odds ratio of 10 in any trial, so an appropriate prior would make effects of this (implausible) size unlikely. More importantly, in this case, getting too hung up on priors is a bit irrelevant, because the trial was so huge (over 20,000 participants) that the data will completely swamp any reasonable prior. </p>
<p> It isn’t possible to re-create the analysis from the information in the paper, as it was a cluster-randomised trial with crossover, which needs to be taken into account. Just using the outcome data for survival to discharge in a quick and dirty Bayesian analysis, though, gives a 95% credible interval of something like from 0.84 to 1.00, with a probability of the odds ratio being less than 1 of about 98%. That probably isn’t too far away from the correct result, and suggests pretty strongly that survival may be a bit worse in the continuous compression group. “No difference” just doesn’t seem like an adequate summary to me. </p>
<p>My letter and the authors’ reply are here: http://www.nejm.org/doi/full/10.1056/NEJMc1600144 </p>
<p>The original trial report is here: Nichol G, Leroux B, Wang H, et al. Trial of continuous or interrupted chest compressions during CPR. N Engl J Med 2015;373:2203-2214 http://www.nejm.org/doi/full/10.1056/NEJMoa1509139</p>
<p>I recently had a letter in the New England Journal of Medicine, about a trial they had published that compared continuous versus interrupted chest compressions during resuscitation after cardiac arrest. Interrupted compressions are standard care - the interruptions are for ventilations to oxygenate the blood, prior to resuming chest compressions to keep it circulating. The issue was that the result of the trial was 0.7% better survival in the interrupted-compression group, with 95% CI from -1.5% to 0.1%. So the data are suggesting a probable benefit to interrupted compressions. However, on Twitter the NEJM announced this as “no difference”, no doubt because the difference was not “statistically significant”. So I wrote pointing out that this wasn’t a good interpretation, and the dichotomy into “significant” and “non-significant” is pretty unhelpful in situations where the results are close to “significance”. Bayesian methods have a huge advantage here, in that they can actually quantify the probability of benefit. An 80% probability that the treatment is beneficial is a lot more useful than “non-significance”, and might lead to very different actions. </p>
<p> The letter was published along with a very brief reply from the authors (they were probably constrained, as I was in the original letter, by a tiny word limit): <em>“Bayesian analyses of trials are said to offer some advantages over traditional frequentist analyses. A limitation of the former is that different people have different prior beliefs about the effect of treatment. Alternative interpretations of our results offered by others show that there was not widespread clinical consensus on these prior beliefs. We are not responsible for how the trial results were interpreted on Twitter.”</em> </p>
<p>Taking the last point first: no, the authors did not write the Twitter post. But they also did not object to it. I'm not accusing them of making the error that non-significance = no difference, but it is so common that it usually - as here - passes without comment. But it's just wrong. </p>
<p> Their initial point about priors illustrates a common view, that Bayesian analysis is about incorporating individual prior beliefs into the analysis. While you can do this, it is neither necessary nor a primary aim. As Andrew Gelman has said (and I have repeated before); prior information not prior beliefs. We want to base a prior on the information that we have at the start of the trial, and if that is no information, then that’s fine. However, we almost always do have some information on what the treatment effect might plausibly be. For example, it’s very unusual to find an odds ratio of 10 in any trial, so an appropriate prior would make effects of this (implausible) size unlikely. More importantly, in this case, getting too hung up on priors is a bit irrelevant, because the trial was so huge (over 20,000 participants) that the data will completely swamp any reasonable prior. </p>
<p> It isn’t possible to re-create the analysis from the information in the paper, as it was a cluster-randomised trial with crossover, which needs to be taken into account. Just using the outcome data for survival to discharge in a quick and dirty Bayesian analysis, though, gives a 95% credible interval of something like from 0.84 to 1.00, with a probability of the odds ratio being less than 1 of about 98%. That probably isn’t too far away from the correct result, and suggests pretty strongly that survival may be a bit worse in the continuous compression group. “No difference” just doesn’t seem like an adequate summary to me. </p>
<p>My letter and the authors’ reply are here: http://www.nejm.org/doi/full/10.1056/NEJMc1600144 </p>
<p>The original trial report is here: Nichol G, Leroux B, Wang H, et al. Trial of continuous or interrupted chest compressions during CPR. N Engl J Med 2015;373:2203-2214 http://www.nejm.org/doi/full/10.1056/NEJMoa1509139</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2016-04-14T12:04:20Z
0
Why do they say that? by Simon Gates
Simon Gates
https://blogs.warwick.ac.uk/simongates/entry/why_do_they/
2015-12-10T13:17:49Z
2015-12-09T15:27:02Z
<p>A thing I've heard several times is that Bayesian methods might be advantageous for Phase 2 trials but not for Phase 3. I've struggled to understand why people would think that. To me, the advantage of Bayesian methods comes in the fact that the methods make sense, answer relevant questions and give understandable answers, which seem just as important in Phase 3 trials as in Phase 2.</p>
<p>One of my colleagues gave me his explanation, which I will paraphrase. He made two points:</p>
<p><em>1. Decision-making processes are different after Phase 2 and Phase 3 trials; folowing Phase 2 decisions about whether to proceed further are made by researchers or research funders, but after Phase 3 decisons (about use of therapies presumably) are taken by "society" in the form of regulators or healthcare providers. This makes the Bayesian approach harder as it is harder to formulate a sensible prior (for Phase 3 I think he means).</em></p>
<p><em>2. In Phase 3 trials sample sizes are larger so the prior is almost always swamped by the data, so Bayesian methods don't add anything.</em></p>
<p>My answer to point 1: Bayesian methods are about more than priors. I think this criticism comes from the (limited and in my view somewhat misguided) view of priors as a personal belief. That is one way of specifying them but not the most useful way. As Andrew Gelman has said, prior INFORMATION not prior BELIEF. And you can probably specify information in pretty much the same way for both Phase 2 and Phase 3 trials.</p>
<p>My answer to point 2: Bayesian methods aren't just about including prior information in the analysis (though they are great for doing that if you want to). I'll reiterate my reasons for preferring them that I gave earlier - the methods make sense, answer relevant questions and give understandable answers. Why would you want to use a method that doesn't answer the question and nobody understands? Also, If you DO have good prior information, you can reach an answer more quickly by incorporating that in the analysis - which we kind of do by doing trials and then combining them with others in meta-analyses; but doing it the Bayesian way would be neater and more efficient.</p>
<p>A thing I've heard several times is that Bayesian methods might be advantageous for Phase 2 trials but not for Phase 3. I've struggled to understand why people would think that. To me, the advantage of Bayesian methods comes in the fact that the methods make sense, answer relevant questions and give understandable answers, which seem just as important in Phase 3 trials as in Phase 2.</p>
<p>One of my colleagues gave me his explanation, which I will paraphrase. He made two points:</p>
<p><em>1. Decision-making processes are different after Phase 2 and Phase 3 trials; folowing Phase 2 decisions about whether to proceed further are made by researchers or research funders, but after Phase 3 decisons (about use of therapies presumably) are taken by "society" in the form of regulators or healthcare providers. This makes the Bayesian approach harder as it is harder to formulate a sensible prior (for Phase 3 I think he means).</em></p>
<p><em>2. In Phase 3 trials sample sizes are larger so the prior is almost always swamped by the data, so Bayesian methods don't add anything.</em></p>
<p>My answer to point 1: Bayesian methods are about more than priors. I think this criticism comes from the (limited and in my view somewhat misguided) view of priors as a personal belief. That is one way of specifying them but not the most useful way. As Andrew Gelman has said, prior INFORMATION not prior BELIEF. And you can probably specify information in pretty much the same way for both Phase 2 and Phase 3 trials.</p>
<p>My answer to point 2: Bayesian methods aren't just about including prior information in the analysis (though they are great for doing that if you want to). I'll reiterate my reasons for preferring them that I gave earlier - the methods make sense, answer relevant questions and give understandable answers. Why would you want to use a method that doesn't answer the question and nobody understands? Also, If you DO have good prior information, you can reach an answer more quickly by incorporating that in the analysis - which we kind of do by doing trials and then combining them with others in meta-analyses; but doing it the Bayesian way would be neater and more efficient.</p>
Evidence-based everything
https://blogs.warwick.ac.uk/simongates/
(C) 2020 Simon Gates
2015-12-10T13:17:49Z
0