Dent & Raftery analysed how many trials funded by HTA showed treatment benefit, harm, or were inconclusive (Trials 2011; 12:109). They found that 24% of trials showed a "significant" result (19% in favour of the new intervention and 5% in favour of control). How many of these results are likely to be correct?
Power and the proportion of interventions that are really effective determine the number of significant results that we see. If all interventions are effective, and power of all the trials is 90%, then 90% of trials will give a "significant" result. If no interventions are effective then we expect 5% of results to be significant; these are all false positives. See the graph below.
Proportion of truly effective trials in a field (i.e. the “population” of interventions that could be tested) against the proportion of “significant” effects that will be found, for different values of power. The proportion of truly effective interventions always exceeds the proportion found to be significant.
So reading across from 24% significant results; in the population of interventions there could be from about 25% (if power is 90%) to 100% (if power is less than about 25%) effective interventions. Power probably isn't 90%; most trials are deigned to have 80-90% power, usually for an optimistic effect size, so given that many effect sizes are smaller than anticipated and many trials recruit fewer than expected, power might typically be 60-70%. This suggests that there would be around 30% of interventions that are truly effective in the population of interventions evaluated by HTA trials.
How often do significance tests get it right? If power is 70%, 86.86% of significant differences are truly effective interventions, meaning that 13.14% are ineffective. As power decreases, the proportion of significant differences that are really effective decreases - the positive predictive value of a significant result gets worse.
Proportion of significant effects that are truly effective (PPV) for different values of power.
If a low proportion are really effective, a lot of significant effects will be false positives. Low power also makes this worse.