July 06, 2016

Bibliometrics of Cluster Inference

Writing about web page http://www.pnas.org/content/early/2016/06/27/1602413113.full

In Eklund et al., Cluster Failure: Why fMRI inferences for spatial extent have inflated false-positive rates, we report on a massive evaluation of statistical methods for task fMRI using resting state as a source of null data. We found very poor performance with one variant of cluster size inference, with a cluster defining threshold (CDT) of P=0.01 giving familywise error (FWE) rates of 50% or more when 5% was expected; CDT of P=0.001 was much better, sometimes within the confidence bounds of our evaluation but almost always above 5% (i.e. suggesting a bias). We subsequently devel into the reasons for these inaccuracies, finding heavy-tailed spatial correlations and spatially-varying smoothness to be the likely suspects. In contrast, non-parametric permutation performs ‘spot on’, only having some inaccuracies in one-sample t-tests, likely due to asymmetric distribution of the errors.

I’ve worked on comparisons of parametric and nonparametric inference methods for neuroimaging essentially my entire career (see refs in Eklund et al.). I’m especially pleased with this work for (1) the use of a massive library of resting-state fMRI for null realisations, finally having samples that reflect the complex spatial and temporal dependence structure of real data, and (2) getting to the bottom of why the the parametric methods don’t work.

However, there is one number I regret: 40,000. In trying to refer to the importance of the fMRI discipline, we used an estimate of the entire fMRI literature as number of studies impinged by our findings. In our defense, we found problems with cluster size inference in general (severe for P=0.01 CDT, biased for P=0.001), the dominant inference method, suggesting the majority of the literature was affected. The number in the impact statement, however, has been picked up by popular press and fed a small twitterstorm. Hence, I feel it’s my duty to make at least a rough estimate of “How many articles does our work affect?”. I’m not a bibliometrician, and this really a rough-and-ready exercise, but it hopefully gives a sense of the order of magnitude of the problem.

The analysis code (in Matlab) is laid out below, but here is the skinny: Based on some reasonable probabilistic computations, but perhaps fragile samples of the literature, I estimate about 15,000 papers use cluster size inference with correction for multiple testing; of these, around 3,500 use a CDT of P=0.01. 3,500 is about 9% of the entire literature, or perhaps more usefully, 11% of papers containing original data. (Of course some of these 15,000 or 3,500 might use nonparametric inference, but it’s unfortunately rare for fMRI—in contrast, it’s the default inference tool for structural VBM/DTI analyses in FSL).

I frankly thought this number would be higher, but didn’t realise the large proportion of studies that never used any sort of multiple testing correction. (Can’t have inflated corrected significances if you don’t correct!). These calculations suggest 13,000 papers used no multiple testing correction. Of course some of these may be using regions of interest or sub-volume analyses, but it’s a scant few (i.e. clinical trial style outcome) that have absolutely no multiplicity at all. Our paper isn’t directly about this group, but for publications that used the folk multiple testing correction, P<0.001 & k>10, our paper shows this approach has familywise error rates well in excess of 50%.

So, are we saying 3,500 papers are “wrong”? It depends. Our results suggest CDT P=0.01 results have inflated P-values, but each study must be examined… if the effects are really strong, it likely doesn’t matter if the P-values are biased, and the scientific inference will remain unchanged. But if the effects are really weak, then the results might indeed be consistent with noise. And, what about those 13,000 papers with no correction, especially common in the earlier literature? No, they shouldn’t be discarded out of hand either, but a particularly jaded eye is needed for those works, especially when comparing them to new references with improved methodological standards.

My take homes from this exercise have been:
  • No matter what method you’re using, if you go to town on a P-value on the razor’s edge of P=0.05000, you lean heavily on the assumptions of your method, and any perturbation of the data (or slight failure of the assumptions) would likely give you a result on the other side of the boundary. (This is a truism for all of science, but particularly for neuroimaging where we invariably use hard thresholds.)
  • Meta-analysis is an essential tool to collate a population of studies, and can be used in this very setting when individual results have questionable reliability. In an ideal world, all studies, good and bad, would be published with full data sharing (see next 2 points), each clearly marked with their varying strengths of evidence (no correction < FDR voxel-wise < FDR cluster-wise < FWE cluster-wise < FWE voxel-wise). This rich pool of data, with no file drawer effects, could then be distilled with meta-analysis to see what effects stand up.
  • Complete reporting of results, i.e. filing of statistical maps in public repositories, must happen! If all published studies’ T maps were available, we could revisit each analysis (approximately at least). The discipline of neuroimaging is embarrassingly behind genetics and bioinformatics, where SNP-by-SNP or gene-by-gene results (effect size, T-value, P-values) are shared by default. This is not a “data sharing” issue, this is a transparency of results issue… that we’ve gone 25 years showing only bitmap JPEG/TIFF’s of rendered images or tables of peaks is shameful. I’m currently in discussions with journal editors to press for sharing of full, unthresholded statistic maps to become standard in neuroimaging.
  • Data sharing, must also come. With the actual full data, we could revisit each paper’s analysis, exactly, and, what’s more, in 5 years, revisit again with even better methods, or for more insightful (e.g. predictive) analyses.


The PNAS article has now been corrected:
Nstudies=40000;   % N studies in PubMed with "fMRI" in title/abstract [1]
Pdata=0.80;       % Prop. of studies actually containing data [2]
Pcorr=0.59;       % Prop. of studies correcting for multiple testing, among data studies [3]
Pclus=0.79;       % Prop. of cluster inference studies, among corrected studies [4]
Pgte01=0.24;      % Prop. of cluster-forming thresholds 0.01 or larger, among corrected cluster inference studies [5]

% Number of studies using corrected  cluster inference (higher P)
Nstudies*Pdata*Pcorr*Pclus        % 14,915

% Number of studies using corrected cluster inference with cluster defining threshold of 0.01 or lower (higher P)
Nstudies*Pdata*Pcorr*Pclus*Pgte01 %  3,579

% Number of studies with original data not using a correction for multiple testing
Nstudies*Pdata*(1-Pcorr)          % 13,120 


  1. 42,158 rounded down, from a Pubmed search for “((fmri[title/abstract] OR functional MRI[title/abstract]) OR functional Magnetic Resonance Imaging[title/abstract])” conducted 5 July 2016.
  2. Carp, 2012, literature 2007 – 2012, same search as in [1], with additional constraints, from which a random sample of 300 were selected. Of these 300, 59 excluded as not presenting original fMRI data, (300-59)/300=0.80.
  3. Carp, 2012, “Although a majority of studies (59%) reported the use of some variety of correction for multiple comparisons, a substantial minority did not.”
  4. Woo et al., 2014, papers Jan 2010 – Nov 2011, “fMRI” and “threshold”, 1500 papers screened, 814 included; of those 814, 607 used cluster thresholding, and 607/814=0.746 matching 75% in Fig 1. However, Fig 1 also shows that 6% of those studies had no correction. To match up with Carp’s use of corrected statistics, we thus revise this to 607/(814-0.06*814)=0.79
  5. Woo et al., data from author (below), cross-classifying 480 studies with sufficient detail to determine the cluster defining threshold. There are 35 studies with a CDT P>0.01, 80 using CDT P=0.01, giving 115/480=0.240.
            AFNI     BV    FSL    SPM   OTHERS
            ____     __    ___    ___   ______

    >.01      9       5     9       8    4     
    .01       9       4    44      20    3     
    .005     24       6     1      48    3     
    .001     13      20    11     206    5     
    <.001     2       5     3      16    2     


Revised to add mention of the slice of the literature that actually does use nonparametric inference (thanks to Nathaniel Daw).

Initially failed to adjust for final comment in note #4; correction increased all the cluster numbers slightly (thanks to Alan Pickering).

Added a passage about possibility of clinical trial type studies, with truly no multiplicity.

Added take-home about meta-analysis.

Revised to reference PNAS correction. 16 August 2016.

- 30 comments by 1 or more people Not publicly viewable

[Skip to the latest comment]
  1. Alan Pickering

    Nice post. c. 16500 highly questionable results (even if this is out of the headline figure of 40000) shows that the message of your PNAS paper is a very important one.

    Pclus in your code is shown as 0.73 but you compute it, note 4, as 0.793. It doesn’t make any odds to these ballpark figures of course.

    06 Jul 2016, 09:42

  2. Thomas Nichols

    Alan: Thanks for catching this! The ‘6%’ adjustment was a last minute addition; I’ve revised the numbers, and now all the counts for cluster inference are slightly higher.

    06 Jul 2016, 11:12

  3. Finn Årup Nielsen

    A somewhat dated and perhaps biased answer to Pcorr is available for studies in the Brede Database: Figure 3 in “fMRI Neuroinformatics” [1] gives around a third.

    [1] http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3516/pdf/imm3516.pdf

    06 Jul 2016, 15:25

  4. Thomas Nichols

    Finn: If we could a ‘meta-regression’ on use of corrected significance, I’d suspect (hope!) there is a strong positive trend, with it being very rare initially and and then steadily climbing with time. So your 1/3 makes sense from a 2004 snapshot of the Brede database.

    06 Jul 2016, 15:50

  5. Jeff Johnson

    Very nice paper and subsequent discussions, Tom!
    I have one question though: Do you know where the “folk multiple testing correction, P10” comes from? Seems like the k>10 part might be used by people that resample to 2×2x2mm voxels, but I always thought k>5 3×3x3mm voxels was the popular method (at least in my corner of the literature). The latter would also require cluster sizes about 50% larger (135 vs. 80 mm^3).

    06 Jul 2016, 22:19

  6. Thomas Nichols

    Jeff: Good question! I have no idea where that correction comes from… Certainly the P<0.001 threshold, as a default, has been in SPM for ages, perhaps since the first version, and I won't be surprised if 10 came from SPM as well. Another motivation of a 10 voxel threshold is that it is at least 1 voxel larger than a 3x3x3 cube??

    07 Jul 2016, 06:10

  7. Cindy Lustig

    Jeff, Tom: I believe the p <.001 k = 10 threshold came from

    Forman, S.D., Cohen, J.D., Fitzgerald, M., Eddy, W.F., Mintun, M.A.,
    Noll, D.C. (1995). Improved assessment of significant activation in functional
    magnetic resonance imaging (fMRI): use of a cluster-size threshold.
    Magnetic Resonance in Medicine, 33(5), 636–47.

    See Bennet et al. 2009 SCAN for a discussion.

    - Cindy

    07 Jul 2016, 11:07

  8. Thomas Nichols

    Cindy: Thanks! That sounds right; and for others, the full Bennet ref is below.

    Bennett, C. M., Wolford, G. L., & Miller, M. B. (2009). The principled control of false positives in neuroimaging. Social Cognitive and Affective Neuroscience, 4(4), 417–22. doi:10.1093/scan/nsp053

    07 Jul 2016, 11:58

  9. Dan H

    Regarding the origin of the uncorrected p<0.001, this is the threshold used in one of the first fMRI papers, Kwong, et al, PNAS 1992. For what it's worth, two of the other initial fMRI papers: Bandettini et al MRM 1992 and Ozawa et al PNAS 1990 have no statistical thresholds at all. I don't think any of their results are now being called to question.

    07 Jul 2016, 22:53

  10. David Colquhoun

    This is a terrific piece of work. I can’t imagine why it hasn’t been done earlier. And I can’t imagine why resampling methods aren’t used more often.

    I wonder, though, whether the problem isn’t even worse than you say. If you are successful in correcting for multiple comparisons and get a true type 1 error rate of, say, 5%, we are still left with the problem that the type 1 errors rate doesn’t tell us what we want to know. What we need is clearly the false positive rate. And for P=0.047 this is at least 30% in a simple example -see for example http://rsos.royalsocietypublishing.org/content/1/3/140216

    It would be interesting to do similar calculations for your sort of analysis.

    07 Jul 2016, 23:20

  11. Thomas Nichols

    Dan: Thanks for the reference… and great point. Just because a paper doesn’t use correction doesn’t mean its crap… you have to examine it, and the strength of the signal. And, of course, even if one paper has weak signal, and, individually, I’d call it “not significant after multiple correction”, it may well be replicated by another study (and that would be confirmed by a meta-analysis). Just as a paper can report a fully corrected finding that just doesn’t replicate.

    The practical answer isn’t “review each study”, it’s “do meta-analysis”.

    08 Jul 2016, 06:51

  12. Thomas Nichols

    David: Good point, I haven’t seen any extension of such analyses like yours, that get at something like a positive predictive value, extended to FWE corrected P-values. Work needed!

    08 Jul 2016, 06:53

  13. David Colquhoun

    Thanks for that response. I do hope that someone does such an analysis soon.

    On a trivial matter of notation, I much prefer the term “false positive rate” (or “false discovery rate”) to “positive predictive value”, because the former are more or less self-explanatory whereas PPV is totally obscure to most people.

    08 Jul 2016, 10:13

  14. Matt

    While your paper addressed an important issue, the way it was presented was sensationalist and designed for the news headlines, and was damaging for the field.
    The suggestion that most fMRI studies may be false and should be re-done is out of touch with reality, and was irresponsible. Do you read the fMRI literature? I mean that in all seriousness, because there is no way that you can look at the collective body of fMRI work and question the validity of most studies. Of course there are false positives out there and it is critical to pay attention to proper statistical methods. However, your paper threw the entire fMRI literature under the bus, and completely ignored the huge number of empirically replicated findings. Your paper was irresponsible because it was ready made for headlines that lack nuance, and these headlines have shaken the public’s trust in fMRI research. This is sad, because the public is ultimately paying for fMRI research, which is already under funded. I really hope there are no adverse consequences (i.e., a drop in funding for fMRI), but time will tell. Did you stop to think about these potential consequences prior to publishing?
    I hope you can see that your paper was as much about your ego and personal success as it was about progress in neuroscience. I urge you to consider this before submitting future work. If you truly desire to help neuroscience research, then please consider retracting your paper (for making exaggerated claims that are not empirically supported), or at least publish a clarification indicating that the 40,000 number was meaningless, and that you truly have no idea how many published papers contain false positives.
    I hope you take these comments seriously.
    Kind regards.

    09 Jul 2016, 02:09

  15. Ben de Haas

    David C: Interesting paper! ‘And for P=0.047 this is at least 30% in a simple example’ – might be worth mentioning that this critically rests on the assumption that 90% of hypotheses tested are wrong in the first place

    09 Jul 2016, 11:07

  16. Thomas Nichols

    Matt: As soon as this was taken up by mainstream media I realised we erred in our statement of its impact and I immediately set about correcting the record.

    But to be fair, there are, as I see, exactly two misstatements (that we are currently working with the PNAS Editor to fix): The last sentence of the Significance statement, and the first sentence of the closing sub-section (The Future of fMRI), two places where we wrongly indicated all fMRI studies were affected. The balance of the paper is, IMHO, sober, methodological statistical analyses of null data, and a detailed exploration of the causes of sources of the inaccuracies of familywise false positive control. I believe it belongs in PNAS not because it is some ‘bomb shell’ on fMRI, but because it is a uniquely massive evaluation with realistic noise realisations (i.e. from real data instead of simulation), and for once identifies why the standard methods don’t work.

    09 Jul 2016, 11:25

  17. Yehouda Harpaz

    to matt:

    You write

    ”.. and completely ignored the huge number of empirically replicated findings. “

    It will be interesting to see at least one example of comparison between independent sets
    of data.
    If you search the internet you find things like this:
    which doesn’t feel you with confidence

    09 Jul 2016, 18:43

  18. Patrick

    The field of neuroimaging sits at an intersection of many different types of people with different types of smarts. There exists a set of neuroimaging reserachers who simply do not have the background or the interest to resolve this type of technical aspect of the underlying analytic methods. Therefore, such a contribution is very valuable if for no other reason than to give some concrete notion to the limitations of these statistical threshold strategies for those who will not/cannot look into for themselves. For that, and the clarity with which you present the findings, thanks to the authors.

    11 Jul 2016, 12:14

  19. Kevin Black

    Your paper is great. I was surprised that the Gaussian field assumptions were a major problem, because SPM (with which I’m most familiar) recognized that the raw data might not fit that model so recommended smoothing the data with a Gaussian kernel, and assumed that that would take care of it.

    My main comment is this: I believe the real problem is less with the specific statistical methods used and more with an explicit rejection of tight Type I error control. The apparent motivation for that rejection was that methods with tight Type I error control (like SnPM) didn’t generate as many publications. Ditto for other methods—before my time, Wash. U. folks used an empirical gamma distribution made from a large sample of PET blood flow images to control for Type I error, but it never caught on much elsewhere. When I first started writing neuroimaging papers in the late 1990s, we had a lot of data from a small number of nonhuman subjects, and a split sample (hypothesis-generating, hypothesis-testing) approach was recommended, but most reviewers complained about it as being too complicated.

    The only data I recall seeing that might support a principled argument to intentionally limit Type I error control was from Thiriol et al 2007, doi: 10.1016/j.neuroimage.2006.11.054. Since strict Type I error control raised the rate of false negatives (in terms of group reproducibility), the idea was that a balance was appropriate. Interestingly that paper concluded that an initial threshold of Z=2.7-3.0 (p=0.0035-0.001) was about right (page 116).

    12 Jul 2016, 00:30

  20. Thomas Nichols


    The issue of Type I vs Type II error is indeed important, and we basically eschew this, considering only the accuracy of familywise Type I error control. Thank you for pointing out the Thirion et al. (2007) paper’s result! I had just looked at that paper but skipped that figure (Fig 7, on pg 115 actually) that shows for N=16, and for the characteristics of their signal, a threshold of around Z=3 indeed maximises the cluster stability. The only problem is we don’t know how that would generalise to other sample sizes, signal strengths, and spatial arrangement of signal.

    12 Jul 2016, 06:12

  21. Yolanda

    Matt: the fMRI workforce may be underfunded as you say but that’s because the fMRI research created a bubble that just waits to burst, too many centers, too many researchers, a lot with little or no expertise and some just doing anything to survive.
    Tom, great endeavor … I’ll keep questioning your results as well. I’m all for meta-analysis, nobody can’t wrap their head around all kinds of results that butt heads with each other.

    12 Jul 2016, 13:48

  22. Ben de Haas


    ‘It will be interesting to see at least one example of comparison between independent sets
    of data’ – A good starting point is here: http://www.ncbi.nlm.nih.gov/pubmed/17964252 fMRI can give you data that is highly reliable and matches exactly what you’d expect based on electrophysiology. Findings like the basic layout of cortical visual field maps or category preferring patches in IT replicate on the single participant level. This has been found thousands of times and in labs all over the world. Which of course renders the ‘replications’ as such rather dull – they don’t make paper titles or abstracts, but typically appear as a blurb in a methods section (functional ROI definition etc.). But this just reflects that by now we consider those findings robust enough to treat them as facts about the human brain.

    We currently have a paper under review testing the re-test reliability of population receptive field estimates on a voxel-by-voxel level. For some parameters this within-participant reliability is >.9(!) fMRI as such can be super robust or flaky, it very much depends what you do with it. A hammer as such is neither ‘a good method’, nor a bad one…. On second thought, maybe we should hype the sh*t out of our paper and shout sweeping over-generalisations all over the media: ‘fMRI robust after all!’ =P

    13 Jul 2016, 11:10

  23. Cindy Lustig

    Tom – do you know if the AFNI bug w/re AlphaSim also extends to the REST version of AlphaSim?

    Yehouda/Yolanda – while Matt’s statements are quite overblown, so are yours. Besides meta-analyses, another common (though perhaps not as common as it should be…) approach is to use a priori ROIs defined on one dataset – either from your own lab or from a paper in the literature – and then apply them to your current, completely separate dataset. You still of course have to be thoughtful about interpretation – there’s no analysis shortcut that gets you out of that.

    14 Jul 2016, 02:17

  24. Thomas Nichols

    Cindy: We contacted the REST authors last week, and the seemed to suggest it is a problem. Anyway, it’s important to note that fixing the bug only marginally improved FWE. The bigger issue is the (unsupported) assumption of the Gaussian form for the spatial autocorrelation and homogeneous smoothness in space.

    The latest 3dClusterSim changes address this, using a more flexible and spatially varying smoothing kernel, and give much better results, though not as valid/stable as permutation. See the poster by Bob Cox at OHBM this year: https://ww5.aievolution.com/hbm1601/files/content/abstracts/40760/2082_Cox.pdf

    14 Jul 2016, 05:51

  25. Yehouda Harpaz

    to Ben de Haas:
    The reference that you give doesn’t contain anything that even resemble comparing data with data, and the
    words “replication” or “reproduction” do not appear in it at all. Another demonstration of the contempt for the concept
    (and bogus reference technique).

    Re-testing within the same individual is reasonably reliable, but doesn’t give generality. For that you need
    replication across individuals, and you don’t have that except for the low-resolution features.

    If fMRI studies were making claims only about low-resolution features, or only about within subjects features, you could
    (quite easily) defend it. But they make much detailed claims which are supposed to be general, and the data
    does not support that, because it is not reproducible.

    We hear a lot about how difficult it is to replicate fMRI studies, but the main reason that
    it doesn’t happen is that everybody knows that you cannot replicate fMRI results except
    within subjects or in uninteresting low-resolution, so nobody even tries anymore. Solving
    the technical issues is important, but it is not going to solve the attitude to replication.

    14 Jul 2016, 16:50

  26. Yehouda Harpaz

    To Cindy:
    If you could point to a study that uses ROI and shows replication by comparing data to data, you would have a point.
    Without such study, you don’t.

    14 Jul 2016, 17:00

  27. Taylor

    Could you comment on the applicability of your findings to BrainVoyager’s Cluster-Level Statistical Threshold Estimator plugin?

    15 Jul 2016, 18:46

  28. Yolanda

    Cindy, I can see you didn’t like my comments but I didn’t get based on what you deemed them outrageous. Is it not true that a lot of MRI studies are conducted in the psychology departments, is it not true that most of the people in these departments don’t have expertise in mathematics, physics and computer science, is it not true that most of their doctoral students don’t have much technical knowledge about the algorithms for data processing and analysis and rarely understand the limitations of the techniques? They just learn to use the available software with the default settings and parameters. And the end goal is to publish a paper or two using neuroimaging so they can finish their thesis and get out of there, given that just behavioral data in healthy controls is not of interest for funding agencies. I’m talking about the US, but I doubt it is much different elsewhere, maybe just at a different scale.
    The number of MRI scanners per 1M population is 2-3 times above the clinical need. And research funding is critical for a lot of them.

    16 Jul 2016, 00:28

  29. Thomas Nichols

    Taylor: I’ve just quickly scanned the BrainVoyager documentation, and it appears that the Cluster-Level uses the Monte Carlo approach of Forman et al. (1995), i.e. the same as 3dClustSim. I would assume that it has similar issues CDT P=0.01 (note that the 3dClustSim bug had relatively little contribution to the inflated FWE.

    16 Jul 2016, 09:12

  30. Kevin Black

    I love this quote: “Complete reporting of results, i.e. filing of statistical maps in public repositories, must happen! ... that we’ve gone 25 years showing only bitmap JPEG/TIFF’s of rendered images or tables of peaks is shameful. I’m currently in discussions with journal editors to press for sharing of full, unthresholded statistic maps to become standard in neuroimaging.” I quoted your similar “Post-It” comment in an editorial: http://f1000research.com/articles/3-272/v1 .

    I’d be glad to join your campaign talking with journal editors, if it will help.

    26 Jul 2016, 18:49

Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

Search this blog


Most recent comments

  • I look this up every couple of years, and always struggle with it, so here are some notes for improv… by johann beda on this entry
  • I love this quote: "Complete reporting of results, i.e. filing of statistical maps in public reposit… by Kevin Black on this entry
  • Taylor: I've just quickly scanned the BrainVoyager documentation, and it appears that the Cluster–Le… by Thomas Nichols on this entry
  • Cindy, I can see you didn't like my comments but I didn't get based on what you deemed them outrageo… by Yolanda on this entry
  • Thomas, Could you comment on the applicability of your findings to BrainVoyager's Cluster–Level Stat… by Taylor on this entry

Blog archive

Not signed in
Sign in

Powered by BlogBuilder