I recently came across an exchange in Psyc. Science that perfectly illustrates some of the problems involved with the use of Bayes factors. Scheibehenne, Jamil, and Wagenmakers (2016a) meta-analyze the probability of hotel towel reuse in two conditions, and compare two models of the log odds using a Bayes factor. They even conduct sequential testing of the Bayes factor, claiming “…the Bayesian approach relies on the data that were actually observed and allows evidence to be seamlessly updated after every new study”, which results in hugely inflated error rates, as I’ve pointed out before.

Carlsson, Schimmack, Williams, and Bürkner (2017b) argue in favor of a hierarchical modeling approach, returning a full posterior distribution over the parameters (the only truly Bayesian way to do things). Really, the heart of the disagreement centers around a fixed vs. random effects approach to modeling the data, and Scheibehenne, Jamil, and Wagenmakers (2016b) respond with a detailed model comparison using both the Bayes factor and Bayesian model averaging. The results seem to the support a choice of the fixed-effects model, but inadvertently end up showing why Bayes factors are almost impossible to use and interpret in practice.

For the studies in the meta-analysis, the authors observe effect sizes with associated variances . Under the fixed-effects model, each observation is normally distributed with common mean , where has a zero mean normal prior with variance . Formally:

In the random effects model, we assume that the observations estimate true effects , which are normally distributed with variance , where is given a half-Cauchy prior with scale parameter , so that

If we want, we can even marginalize over the latent random-effects for simplicity, giving

which is the way non-Bayesian random effects meta-analyses are usually handled in the literature. Intuitively, both of these models seem perfectly reasonable, and we would expect both of them to be well behaved if we were only interested in estimating posterior distributions.

The Bayes factor for the fixed-effect model over the random-effect model is

where and are the parameters of the fixed and random-effects models, respectively.

The Bayes factor seems to offer strong support for the fixed-effect model, as you can see in the figure below (a replica of the figure from Scheibehenne, et al.)

The support for the fixed-effects model is particularly strong when the prior scale is very large, and the authors conclude “random-effects models with diffuse priors on may yield different results from the fixed-effect model, but these random-effects models are overly complex and do not generalize well. The simple fixed-effect model predicts the observed data best.” Note that this does not follow from from any of their results — the Bayes factor quantifies neither predictive accuracy nor generalizability. But the much more serious problem is in the Bayes factor itself. Look at what happens when we use even more diffuse priors:

The Bayes factor becomes linear in ! But why? Well, let’s take a closer look.

For fixed data, the numerator of the Bayes factor is constant in the choice of prior scale , so we will concern ourselves with the denominator. Fortunately, it is not necessary to fully evaluate the integral in order to asses the impact of the choice of . Letting denote the denominator, we have

So the prior scale directly multiplies the Bayes factor. In fact, when , the rightmost fraction is essentially constant over most of the mass for , and so the entire Bayes factor becomes a linear function of . This means that, when is very large, the Bayes factor will always favor the fixed effects model, *regardless of the data*!

In the fully Bayesian procedure advocated by Carlsson et al, any prior which is flat over a reasonable range of parameter values would result in essentially the same inference — i.e. large values of result in an noninformative prior. In contrast, large values of are *highly* informative in the Bayes factor used by Scheibehenne et al. This kind of unintuitive prior dependence is well documented in the literature (e.g. Johnson and Rossell, 2010). Note that Bayesian model averaging does not solve this problem, since the posterior model probabilities are functions of the Bayes factor (Clyde and George, 2004).

In a fully Bayesian model, the researcher can safely choose based on their intuition about the true heterogeneity in their data, knowing that minor differences in their choice are unlikely to affect their inference. Similarly, a researcher who wishes to remain agnostic can simply choose an arbitrarily large value of , knowing that any sufficiently large choice will act as a non-informative prior. Using the Bayes factor, the author can *never* be agnostic, as is always informative. Worse, the way in which inference depends on is not obvious, since it specifies not only a reasonable range of heterogeneity, but also a proportionality constant for the Bayes factor, so how is the researcher to choose a value? And how is the researcher to interpret the result?

Defenders of the Bayes factor would probably argue that its exact value is unimportant in this case — the fact that it consistently favors the fixed-effects model across a range of priors is good enough. But aren’t Bayes factors supposed to quantify evidence? How can they do this if their specific value can be ignored, or is uninterpretable? Most importantly, a researcher who uses a Bayes factor to compare meta-analytic models and probably *doesn’t know* that specifying a non-informative prior over the parameters of automatically increases support for . But these things can easily happen without a complete understanding of the way the priors and models interact. In general, unless you’re willing to write out the marginal density yourself and see exactly what the prior is doing, it’s very easy for a Bayes factor to do something weird. This makes them essentially worthless for scientists without either a strong background in mathematics and statistics, or a statistician on hand to do the checking for them. **You cannot simply come up with two models and compare them using a Bayes factor**. It won’t work. Your Bayes factor will, in all likelihood, be doing something you don’t know and don’t want.

### References

Carlsson, R., Schimmack, U., Williams, D. R., & Bürkner, P. C. (2017). Bayes Factors From Pooled Data Are No Substitute for Bayesian Meta-Analysis: Commentary on Scheibehenne, Jamil, and Wagenmakers (2016). *Psychological science*, 0956797616684682.

Clyde, M., & George, E. I. (2004). Model uncertainty. Statistical science, 81-94.

Johnson, V. E., & Rossell, D. (2010). On the use of non‐local prior densities in Bayesian hypothesis tests. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, 72(2), 143-170.

Scheibehenne, B., Jamil, T., & Wagenmakers, E. J. (2016a). Bayesian evidence synthesis can reconcile seemingly inconsistent results: The case of hotel towel reuse. *Psychological Science*, 27(7), 1043-1046.

Scheibehenne, B., Gronau, Q. F., Jamil, T., & Wagenmakers, E. J. (2017b). Fixed or random? A resolution through model averaging: reply to Carlsson, Schimmack, Williams, and Bürkner (2017). *Psychological science*, 0956797617724426.