Wednesday, October 7, 2015

Some practical considerations in arguing for the null

Learning about Bayes Factors can be a very liberating experience (as pointed out by Evie Vergauwe at the ESCOP session about Bayesian analyses and subsequently by Candice Morey in her blog). The main attraction of Bayes Factors is that you can use them to argue for the null hypothesis, which you can’t do with frequentist statistics. This means that you can go back to your file drawer, and make some conclusions based on all of those non-significant results that have probably accumulated over the years (even if the conclusion might be that you can’t draw any conclusions given this data).
When I learned about Bayes Factors just before my thesis submission, I thought: “Sweet! Now I can interpret all of my non-significant results!” I reported Bayes Factors in addition to most of the frequentist analyses, which indeed allowed me to draw some conclusions that would otherwise have been impossible: I provided some evidence for null effects, and showed that for some unexpected statistically significant effects, the Bayes Factor provided equivocal evidence, suggesting that they should not be taken too seriously. Soon I realised, however, that there is more to arguing for a null hypothesis than just calculating a Bayes Factor.
Arguing for the null comes with some challenges that are less relevant if the experiment is simply designed to maximise the chances of getting a significant p-value. Therefore, it is not always meaningful to add Bayes Factors to an analysis ad hoc, if the p-value did not come out as significant. When designing a study with the a priori intention of using Bayes Factor analyses, it is important to ask the question: “Does my design maximise my chances of drawing a meaningful conclusion, regardless of whether I get evidence for H1 or H0?” This is a switch in the mind set of the traditional question, “Does my design maximise my chances of getting a significant p-value?” The latter question is, of course, problematic, because often, when such an experiment yields a non-significant p-value, it can only be discarded. This means that a carefully designed experiment addressing theoretically interesting and practically important questions may well turn into nothing more than a waste of the researcher’s resources and an impediment to scientific progress.
Here, I discuss some practical considerations that I came across in some attempts at arguing for the null. I hope this will be helpful to others who are figuring out how to get the most out of Bayes Factors. If anyone has any further suggestions, I would be very happy to get some feedback!

Sample sizes
Having a large sample size is always important, because small samples are prone to be influenced by extreme chance events. Bayes Factor analyses are somewhat less affected by this problem than frequentist statistics: a frequentist analysis will give you a p-value no matter what, while the Bayes Factor also tells you how confident you should be in the result. If the sample is small, Bayes Factor values are likely to hover around the value of 1, thus providing very little evidence about whether the data is more compatible with H0 or H1. Such an equivocal value tells you that you need more data if you want to draw any conclusions.
However, small-sampled studies should always be taken with a grain of salt, as the law of large numbers still applies. If the effect is small, a population group difference may yield identical means in a small sample. Similarly, a numerical difference between two conditions in the absence of a population effect is more likely in a small than large sample. Therefore, bigger is always better when it comes to sample size.

Think about the expected effect size
The key point behind Bayes Factor analyses is that they compare the degree to which the data is consistent with a null hypothesis compared to a pre-specified alternative hypothesis. This is good, because it forces the researcher to be specific about the kind of effect she expects. As a drawback, it allows critics to argue that the evidence for the null is due to the use of an unrealistically high prior for the H1 effect size, and that with a smaller prior the Bayes Factor would probably provide evidence for H1. This issue makes it theoretically impossible to be 100% confident that there is no effect. However, some consideration of the minimum effect size that would be of interest would allow the researcher to use this information in the construction of the H1 prior, and to argue that if an effect exists, it is likely to be even smaller than this minimum.

Avoid confounds, if possible!
Confounding variables are a bigger problem when arguing for H0 than when arguing for H1. For example, I would really like to know whether statistical learning ability is associated with the ease with which children learn print-to-speech correspondences (as measured, for example, by Grapheme-Phoneme Correspondence, or GPC knowledge). One way to go might be to recruit a large number of children, and measure them on an extensive battery of tests. One could even go all out and design a longitudinal study, to see if statistical learning ability in pre-school predicts GPC knowledge after the onset of reading instruction. We know from previous research that statistical learning ability is correlated with vocabulary knowledge and phonological awareness (Spencer, Kaschak, Jones, & Lonigan), which, in turn, are well-known correlates of reading ability and GPC knowledge. So we want to test vocabulary knowledge, phonological awareness, and word reading skills as well, to be sure that a correlation between statistical learning ability and GPC knowledge is not due to these confounding variables. We also want to test other potentially correlated participant-level factors, such as age, intelligence, attention, etc. We expect all of the variables that we measure to correlate with each other (because this is generally the case with developmental data), so the only meaningful result will be the partial regression coefficient of statistical learning ability on GPC knowledge.
If the partial regression coefficient is significant (or, even better, if the Bayes Factor provides evidence for a model including statistical learning ability as well as all the covariates compared to a model including the covariates only), I can conclude that statistical learning ability is correlated with GPC knowledge over and above the covariates.
If the regression coefficient is not significant, and even if we get evidence for the base model including the covariates only over the alternative model, drawing conclusions is trickier. It is possible that statistical learning ability does not have a direct effect on GPC knowledge, but that it affects vocabulary knowledge, which in turn affects phonological awareness, which in turn affects GPC knowledge. Perhaps one could even strengthen a case for such a potential causal pathway with one of ‘em fancy Structural Equation Models. Given the inter-correlated nature of all my independent variables, however, there would be a large amount of possible mediators, and it is likely that even sophisticated statistical analyses cannot give me much useful information. For example, the relationship between statistical learning ability and GPC knowledge may disappear once we take into account phonological awareness. But then, the relationship between statistical learning ability and phonological awareness may also disappear after we take into account GPC knowledge. Thus, we won’t be able to conclude that one mediates the other, because both causal pathways are equally plausible – as are other causal pathways, such as the possibility that all three variables are affected by yet another confound, such as the child’s attentional capacity.
In short, such large-scale experiments with inter-correlated variables is an example of a design that does not maximise the researcher’s chances of being able to draw meaningful conclusions, regardless of the outcome. This is unfortunate, because such large-scale studies are often very time-consuming (for the researcher and participants) and expensive to run. They could still be useful for exploratory purposes, but they are not the best way to answer questions about the relationship between two variables.

To convince an audience of a null result[1], it might be especially important to be as transparent as possible: if the experiment is a failure-to-replicate, desperate authors of the original study may clutch at straws and accuse the replicator of creatively excluding outliers, choosing the wrong priors, or being generally incompetent. Such claims can be mostly counteracted by providing the raw data and the analysis scripts – and, even better, by pre-registering studies, and in an ideal-case scenario, getting the original authors’ approval of the experimental design and analysis plan before data is collected.
If the data and analyses scripts are available, and if original-authors-as-reviewers want to make claims about flaws in the analyses, they can (and should) show (1) how and why the replicator’s analysis are problematic, and (2) that there is evidence for their original effect (or equivocal evidence) once the analyses are done correctly.

In summary, I argue that it is important to design an experiment in such a way that maximises the chance of being able to draw meaningful results, regardless of whether H1 or H0 is supported. Reading through what I have written so far, it strikes me that all of the issues described above apply to all experimental designs, really. However, because some researchers continue to be more easily convinced by relatively shaky evidence for an H1 than by relatively strong evidence for H0,[2] it is especially important to make sure to maximise the strength of an experiment when there is a real possibility that H0 will be supported. I listed four considerations which might help a researcher to design a strong experiment: (1) Large sample sizes, (2) a careful consideration of the expected effect size, (3) avoiding confounding variables, and (4) being as transparent as possible by making the raw data available and by providing full information about all analyses.

Spencer, M., Kaschak, M. P., Jones, J. L., & Lonigan, C. J. Statistical learning is related to early literacy-related skills. Reading and Writing, 1-24.

[1] Admittedly, I have not (yet) succeeded in doing this.
[2] Based on my subjective experience.


  1. Another thing you can do to avoid accusations of choosing the "wrong" prior is to re-analyze it yourself and explicitly show that the evidence isn't qualitatively changed by using other reasonable priors.

  2. Thanks, Alex, that's a great suggestion!
    I actually have a question about this - maybe you know the answer or could direct me to a relevant source:
    I have repeated the Bayes Factor analyses for a study where the default parameters provided evidence for the null, using smaller priors ("rscaleFixed = 0.1"). This has increased the error margin to +/-100%, making the results uninterpretable. Is there a way to change the prior that has less of an effect on the error margins?
    As an alternative approach, a statistician once told me that one can do the analyses both with a large and with a small prior, and then compare the posteriors of the critical effect: if H0 is true, both values should be about equally close to zero. Would you know of any recommendations for drawing conclusions about whether or not the posterior estimates are similar to each other, or whether this is this something that’s up to the researcher’s judgement?
    Any advice would be appreciated!

    1. "Is there a way to change the prior that has less of an effect on the error margins?" Probably not, but I bet you'd be ok with more sensitive data! Maybe ask Morey about it, he'd know more about it than me.

      I wouldn't worry about the posteriors matching. Cauchy/g priors are designed to represent the information value of very few observations (they are t distributions with ~1 df), so with moderate amounts of data they wash out fast. Models with fat-tailed priors will generally do that. If you're using more informative priors, then naturally they will converge at a rate related to their relative informativeness and the information contained in the data.

      But in general I don't recommend looking at posteriors to judge support for H0. Too ad hoc and vague. The only principled way to do it is with a Bayes factor :)