The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Monday, July 18, 2016

Dance of the Bayes factors


You might have seen the ‘Dance of the p-values’ video by Geoff Cumming (if not, watch it here). Because p-values and the default Bayes factors (Rouder, Speckman, Sun, Morey, & Iverson, 2009) are both calculated directly from t-values and sample sizes, we might expect there is also a Dance of the Bayes factors. And indeed, there is. Bayes factors can vary widely over identical studies, just due to random variation.

If people would always correctly interpret Bayes factors, that would not be a problem. Bayes factors tell you how much data are in line with models, and quantify relative evidence in favor of one of these models. The data is what it is, even when it is misleading (i.e., supporting a hypothesis that is not true). So, you can conclude the null model is more likely than some other model, but purely based on a Bayes factor, you can’t draw a conclusion such as “This Bayes factor allows us to conclude that there are no differences between conditions”. Regrettably, researchers are massively starting to misinterpret Bayes factors (I won't provide references, though I have many). This is not surprising – people find statistical inferences difficult, whether these are about p-values, confidence intervals, or Bayes factors.

As a consequence, we see many dichotomous absolute interpretations (“we conclude there is no effect”) instead of continuous relative interpretations (“we conclude the data increase our belief in the null model compared to the alternative model”). As a side note: In my experience some people who advocate Bayesian statistics over NHST often live in a weird Limbo. They believe the null is never true when they are criticizing Null-Hypothesis Significance Testing as a useless procedure because we already know the null is not true, but they love using Bayes factors to conclude the null-hypothesis is supported.

For me, there is one important difference between the dance of the p-values and the dance of the Bayes factors: When people draw dichotomous conclusions, p-values allow you to control your error rate in the long run, while error rates are ignored when people use Bayes factors. As a consequence, you can easily conclude there is ‘no effect’, where there is an effect, 25% of the time (see below). This is a direct consequence of the ‘Dance of the Bayes factors’.

Let’s take the following scenario: There is a true small effect, Cohen’s d = 0.3. You collect data and perform a default two-sided Bayesian t-test with 75 participants in each condition. Let’s repeat this 100.000 times, and plot the Bayes factors we can expect. 



If you like a more dynamic version, check the ‘Dance of the Bayes factors’ R script at the bottom of this post. As output, it gives you a :D smiley when you have strong evidence for the null (BF < 0.1), a :) smiley when you have moderate evidence for the null, a (._.) when data is inconclusive, and a :( or :(( when data strongly support the alternative (smileys are coded based on the assumption researchers want to find support for the null). See the .gif below for the Dance of the Bayes factors if you don’t want to run the script.

I did not choose this example randomly (just as Geoff Cumming did not randomly choose to use 50% statistical power in his ‘Dance of the p-values’ video). In this situation, approximately 25% of Bayes factors are smaller than 1/3 (which can be interpreted as support for the null), 25% are higher than 3 (which can be interpreted as support for the alternative), and 50% are inconclusive. If you would conclude, based on your Bayes factor, that there are no differences between groups, you’d be wrong 25% of the time, in the long run. That’s a lot.

(You might feel more comfortable using a BF of 1/10 as a ‘strong evidence’ threshold: BF < 0.1 happen 12.5% of the time in this simulation. A BF > 10 never happens: We don't have a large enough sample size. If your true effect size is 0.3, you have decided to collect a maximum of 75 participants in each group, and you will look at the data repeatedly until you have ‘strong evidence’ (BF > 10 or BF < 0.1), you will never observe support for the alternative, and you can only observe strong evidence in favor of the null model, even though there is a true effect).

Felix Schönbrodt gives some examples for the probability you will observe a misleading Bayes factor for different effect sizes and priors (Schönbrodt, Wagenmakers, Zehetleitner, & Perugini, 2015). Here, I just want note you might want to take the Frequentist properties of Bayes factors in to account, if you want to make dichotomous conclusions such as ‘the data allow us to conclude there is no effect’. Just as the ‘Dance of the p-values’ can be turned into a ‘March of the p-values’ by increasing the statistical power, you can design studies that will yield informative Bayes factors, most of the time (Schönbrodt & Wagenmakers, 2016). But you can only design informative studies, in the long run, if you take Frequentist properties of tests into account. If you just look ‘at the data at hand’ your Bayes factors might be dancing around. You need to look at their Frequentist properties to design studies where Bayes factors march around. My main point in this blog is that this is something you might want to do.

What’s the alternative? First, never make incorrect dichotomous conclusions based on Bayes factors. I have the feeling I will be repeating this for the next 50 years. Bayes factors are relative evidence. If you want to make statements about how likely the null is, define a range of possible priors, use Bayes factors to update these priors, and report posterior probabilities as your explicit subjective belief in the null.

Second, you might want to stay away from the default priors. Using default priors as a Bayesian is like eating a no-fat no-sugar no-salt chocolate-chip cookie: You might as well skip it. You will just get looks of sympathy as you try to swallow it down. Look at Jeff Rouder’s post on how to roll your own priors.

Third, if you just want to say the effect is smaller than anything you find worthwhile (without specifically concluding there no effect) equivalence testing might be much more straightforward. It has error control, so you won’t incorrectly say the effect is smaller than anything you care about too often, in the long run.

The final alternative is just to ignore error rates. State loudly and clearly that you don’t care about Frequentist properties. Personally, I hope Bayesians will not choose this option. I would not be happy with a literature where thousands of articles claim the null is true, when there is a true effect. And you might want to know how to design studies that are likely to give answers you find informative.

When using Bayes factors, remember they can vary a lot across identical studies. Also remember that Bayes factors give you relative evidence. The null model might be more likely than the alternative, but both models can be wrong. If the true effect size is 0.3, the data might be closer to a value of 0 than to a value of 0.7, but it does not mean the true value is 0. In Bayesian statistics, the same reasoning holds. Your data may be more likely under a null model than under an alternative model, but that does not mean there are no differences. If you nevertheless want to argue that the null-hypothesis is true based on just a Bayes factor, realize you might be fooling yourself 25% of the time. Or more. Or less.


References

  • Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. http://doi.org/10.3758/PBR.16.2.225
  • Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential Hypothesis Testing With Bayes Factors: Efficiently Testing Mean Differences. Psychological Methods. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/26651986
  • Schönbrodt, F. D., & Wagenmakers, E.-J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence (SSRN Scholarly Paper No. ID 2722435). Rochester, NY: Social Science Research Network. Retrieved from http://papers.ssrn.com/abstract=2722435



Thursday, June 23, 2016

NWO pilot project will fund €3.000.000 worth of replication research


The Times Higher Education reports on two new initiatives in the Netherlands to bolster scientific standards. Here, I want to talk about one of these initiatives I was involved in: A fund for replication research. The board of the Dutch science funder NWO still has to officially approve the final call for proposals, but the draft text is basically done.

Now I’m very proud of the Dutch NWO for taking such a bold step, and fund replication research. I’m also proud of the small role I played in this process. And because I think it’s a nice story of how you can sometimes instigate some change as a single individual, I thought I’d share it.

In 2012, Sander Koole and I published a paper on the importance of rewarding replication research. We decided we would not just write about this topic, but act. And thus we wrote to the Dutch science funder NWO. We explained how performing replication research is like eating your veggies (Brussels sprouts, to be precise): Very few people get enthusiastic about vegetables, but if you don’t eat them you’ll never grow big and strong. Because researchers have spent all their money on sweets, science now has cavities. We need to stimulate healthier behavior. And thus NWO should fund replication research. I love how we recommend that NWO should also included money for an open access publication, which ‘is currently the only place where replication studies are published’. How long ago can 2012 feel?

The response of NWO was classic. They write: “As long as a proposal contains innovative elements, and the results can contribute to the development of science, it fits within NWO calls. If the test of earlier results would for example be done through a new method, the proposal can successfully take part in the NWO competition.” (Our piece, and the reply by NWO, is available below, if you read Dutch).

We can see they did not yet get the point (indeed, the replication grants that are now introduced are not available for conceptual replications, only for replication studies that use the same method).

I was a bit annoyed, to say it politely. And here’s an important life lesson I took away from this: When you are truly frustrated, don’t hold it in. I sent NWO an email. Let me quote myself: “Replication is the very foundation of a robust science. The current NWO policy undermines the foundation of science. If this is not dealt with, NWO will be doing more bad than good for science. There is no other solution then to adjust the current policy.”

Now I had just completed my PhD two years earlier, and I fully expected NWO to ignore my opinion. But remarkably, they didn’t. Instead, they invited me over for a talk. A very nice talk. I explained the problems, and reminded NWO they legally had two tasks: Stimulate novel research, and improve the quality of research. This second task, I argued, could use some more attention. I proposed a possible solution (the current grants will be much larger than what I initially suggested), and everyone became enthusiastic about the idea to fund replication research.

Change takes time, but here we are, some four years later, with €3.000.000 for replication research (to be spread out over three years). Many people at NWO have worked very hard on making this possible, and I’m grateful for all their work. It’s been fun to be at the start of something as exciting as this bold step the Dutch science funder NWO is taking. I look forward to the cool projects researchers will do with these grants.


Friday, May 20, 2016

Absence of evidence is not evidence of absence: Testing for equivalence

When you find p > 0.05, you did not observe surprising data, assuming there is no true effect. You can often read in the literature how p > 0.05 is interpreted as ‘no effect’ but due to a lack of power the data might not be surprising if there was an effect. In this blog I’ll explain how to test for equivalence, or the lack of a meaningful effect, using equivalence hypothesis testing. I’ve created easy to use R functions that allow you to perform equivalence hypothesis tests. Warning: If you read beyond this paragraph, you will never again be able to write “as predicted, the interaction revealed there was an effect for participants in the experimental condition (p < 0.05) but there was no effect in the control condition (F < 1).” If you prefer the veil of ignorance, here’s a nice site with cute baby animals to spend the next 9 minutes on instead.

Any science that wants to be taken seriously needs to be able to provide support for the null-hypothesis. I often see people switching over from Frequentist statistics when effects are significant, to the use of Bayes Factors to be able to provide support for the null hypothesis. But it is possible to test if there is a lack of an effect using p-values (why no one ever told me this in the 11 years I worked in science is beyond me). It’s as easy as doing a t-test, or more precisely, as doing two t-tests.

The practice of Equivalence Hypothesis Testing (EHT) is used in medicine, for example to test whether a new cheaper drug isn’t worse (or better) than the existing more expensive option. A very simple EHT approach is the ‘two-one-sided t-tests’ (TOST) procedure (Schuirmann, 1987). Its simplicity makes it wonderfully easy to use.

The basic idea of the test is to flip things around: In Equivalence Hypothesis Testing the null hypothesis is that there is a true effect larger than a Smallest Effect Size of Interest (SESOI; Lakens, 2014). That’s right – the null-hypothesis is now that there IS an effect, and we are going to try to reject it (with a p < 0.05). The alternative hypothesis is that the effect is smaller than a SESOI, anywhere in the equivalence range - any effect you think is too small to matter, or too small to feasibly examine. For example, a Cohen’s d of 0.5 is a medium effect, so you might set d = 0.5 as your SESOI, and the equivalence range goes from d = -0.5 to d = 0.5 In the TOST procedure, you first decide upon your SESOI: anything smaller than your smallest effect size of interest is considered smaller than small, and will allow you to reject the null-hypothesis that there is a true effect. You perform two t-tests, one testing if the effect is smaller than the upper bound of the equivalence range, and one testing whether the effect is larger than the lower bound of the equivalence range. Yes, it’s that simple.

Let’s visualize this. Below on the left axis is a scale for the effect size measure Cohen’s d. On the left is a single 90% confidence interval (the crossed circles indicate the endpoints of the 95% confidence interval) with an effect size of d = 0.13. On the right is the equivalence range. It is centered on 0, and ranges from d = -0.5 to d = 0.5.

We see from the 95% confidence interval around d = 0.13 (again, the endpoints of which are indicated by the crossed circles) that the lower bound overlaps with 0. This means the effect (d = 0.13, from an independent t-test with two conditions of 90 participants each) is not statistically different from 0 at an alpha of 5%, and the p-value of the normal NHST is 0.384 (the title provides the exact numbers for the 95% CI around the effect size). But is this effect statistically smaller than my smallest effect size of interest?

Rejecting the presence of a meaningful effect

There are two ways to test the lack of a meaningful effect that yield the same result. The first is to perform two one sided t-tests, testing the observed effect size against the ‘null’ of your SESOI (0.5 and -0.5). These t-tests show the d = 0.13 is significantly larger than d = -0.5, and significantly smaller than d = 0.5. The highest of these two p-values is p = 0.007. We can conclude that there is support for the lack of a meaningful effect (where meaningful is defined by your choice of the SESOI). The second approach (which is easier to visualize) is to calculate a 90% CI around the effect (indicated by the vertical line in the figure), and check whether this 90% CI falls completely within the equivalence range. You can see a line from the upper and lower limit of the 90% CI around d = 0.13 on the left to the equivalence range on the right, and the 90% CI is completely contained within the equivalence range. This means we can reject the null of an effect that is larger than d = 0.5 or smaller than d = -0.5 and conclude this effect is smaller than what we find meaningful (and you’ll be right 95% of the time, in the long run).

[Technical note: The reason we are using a 90% confidence interval, and not a 95% confidence interval, is because the two one-sided tests are completely dependent. You could actually just perform one test, if the effect size is positive against the upper bound of the equivalence range, and if the effect size is negative against the lower bound of the equivalence range. If this one test is significant, so is the other. Therefore, we can use a 90% confidence interval, even though we perform two one-sided tests. This is also why the crossed circles are used to mark the 95% CI.].

So why were we not using these tests in the psychological literature? It’s the same old, same old. Your statistics teacher didn’t tell you about them. SPSS doesn’t allow you to do an equivalence test. Your editors and reviewers were always OK with your statements such as “as predicted, the interaction revealed there was an effect for participants in the experimental condition (p < 0.05) but there was no effect in the control condition (F < 1).” Well, I just ruined that for you. Absence of evidence is not evidence of absence!

We can’t use p > 0.05 as evidence of a lack of an effect. You can switch to Bayesian statistics if you want to support the null, but the default priors are only useful of in research areas where very large effects are examined (e.g., some areas of cognitive psychology), and are not appropriate for most other areas in psychology, so you will have to be able to quantify your prior belief yourself. You can teach yourself how, but there might be researchers who prefer to provide support for the lack of an effect within a Frequentist framework. Given that most people think about the effect size they expect when designing their study, defining the SESOI at this moment is straightforward. After choosing the SESOI, you can even design your study to have sufficient power to reject the presence of a meaningful effect. Controlling your error rates is thus straightforward in equivalence hypothesis tests, while it is not that easy in Bayesian statistics (although it can be done through simulations).

One thing I noticed while reading this literature is that TOST procedures, and power analyses for TOST, are not created to match the way psychologists design studies and think about meaningful effects. In medicine, equivalence is based on the raw data (a decrease of 10% compared to the default medicine), while we are more used to think in terms of standardized effect sizes (correlations or Cohen’s d). Biostatisticians are fine with estimating the pooled standard deviation for a future study when performing power analysis for TOST, but psychologists use standardized effect sizes to perform power analyses. Finally, the packages that exist in R (e.g., equivalence) or the software that does equivalence hypothesis tests (e.g., Minitab, which has TOST for t-tests, but not correlations) requires that you use the raw data. In my experience (Lakens, 2013) researchers find it easier to use their own preferred software to handle their data, and then calculate additional statistics not provided by the software they use by typing in summary statistics in a spreadsheet (means, standard deviations, and sample sizes per condition). So my functions don’t require access to the raw data (which is good for reviewers as well). Finally, the functions make a nice picture such as the one above so you can see what you are doing.

R Functions

I created R functions for TOST for independent t-tests, paired samples t-tests, and correlations, where you can set the equivalence thresholds using Cohen’s d, Cohen’s dz, and r. I adapted the equation for power analysis to be based on d, and I created the equation for power analyses for a paired-sample t-test from scratch because I couldn’t find it in the literature. If it is not obvious: None of this is peer-reviewed (yet), and you should use it at your own risk. I checked the independent andpaired t-test formulas against theresults from Minitab software and reproduced examples in the literature, and checked the power analyses against simulations, and all yielded the expected results, so that’s comforting. On the other hand, I had never heard of equivalence testing until 9 days ago (thanks 'Bum Deggy'), so that’s less comforting I guess. Send me an email if you want to use these formulas for anything serious like a publication. If you find a mistake or misbehaving functions, let me know.

If you load (select and run) the functions (see GitHub gist below), you can perform a TOST by entering the correct numbers and running the single line of code:

TOSTd(d=0.13,n1=90,n2=90,eqbound_d=0.5)

You don’t know how to calculate Cohen’s d in an independent t-test? No problem. Use the means and standard deviations in each group instead, and type:

TOST(m1=0.26,m2=0.0,sd1=2,sd2=2,n1=90,n2=90,eqbound_d=0.5)

You’ll get the figure above, and it calculates Cohen’s d and the 95% CI around the effect size for free. You are welcome. Note that TOST and TOSTd differ slightly (TOST relies on the t-distribution, TOSTd on the z-distribution). If possible, use TOST – but TOSTd (and especially TOSTdpaired) will be very useful for readers of the scientific literature who want quickly check the claim that there is a lack of effect when means or standard deviations are not available. If you prefer to set the equivalence in raw difference scores (e.g., 10% of the mean in the control condition, as is common in medicine) you can use the TOSTraw function.

Are you wondering if your design was well powered? Or do you want to design a study well-powered to reject a meaningful effect? No problem. For an alpha (Type 1 error rate) of 0.05, 80% power (or a beta or Type 2 error rate of 0.2), and a SESOI of 0.4, just type:

powerTOST(alpha=0.05, beta=0.2, eqbound_d=0.4) #Returns n (for each condition)

You will see you need 107 participants in each condition to have 80% power to reject an effect larger than d = 0.4, and accept the null (or an effect smaller than your smallest effect size of interest). Note that this function is based on the z-distribution, it does not use the iterative approach based on the t-distribution that would make it exact – so it is an approximation but should work well enough in practice.

TOSTr will perform these calculations for correlations, and TOSTdpaired will allow you to use Cohen’s dz to perform these calculations for within designs. powerTOSTpaired can be used when designing within subject design studies well-powered to test if data is in line with the lack of a meaningful effect.

Choosing your SESOI

How should you choose your SESOI? Let me quote myself (Lakens, 2014, p. 707):
In applied research, practical limitations of the SESOI can often be determined on the basis of a cost–benefit analysis. For example, if an intervention costs more money than it saves, the effect size is too small to be of practical significance. In theoretical research, the SESOI might be determined by a theoretical model that is detailed enough to make falsifiable predictions about the hypothesized size of effects. Such theoretical models are rare, and therefore, researchers often state that they are interested in any effect size that is reliably different from zero. Even so, because you can only reliably examine an effect that your study is adequately powered to observe, researchers are always limited by the practical limitation of the number of participants that are willing to participate in their experiment or the number of observations they have the resources to collect.

Let’s say you collect 50 participants in two independent conditions, and plan to do a t-test with an alpha of 0.05. You have 80% power to detect an effect with a Cohen’s d of 0.57. To have 80% power to reject an effect of d = 0.57 or larger in TOST you would need 66 participants in each condition.

Let’s say your SESOI is actually d = 0.35. To have 80% power in TOST you would need 169 participants in each condition (you’d need 130 participants in each condition to have 80% power to reject the null of d = 0 in NHST).

Conclusion

We see you always need a bit more people to reject a meaningful effect, than to reject the null for the same meaningful effect. Remember that since TOST can be performed based on Cohen’s d, you can use it in meta-analyses as well (Rogers, Howard, & Vessey, 1993). This is a great place to use EHT and reject a small effect (e.g., d = 0.2, or even d = 0.1), for which you need quite a lot of observations (i.e., 517, or even 2069).

Equivalence testing has many benefits. It fixes the dichotomous nature of NHST. You can now 1) reject the null, and fail to reject the null of equivalence (there is probably something, of the size you find meaningful), 2) reject the null, and reject the null of equivalence (there is something, but it is not large enough to be meaningful, 3) fail to reject the null, and reject the null of equivalence (the effect is smaller than anything you find meaningful), and 4) fail to reject the null, and fail to reject the null of equivalence (undetermined: you don’t have enough data to say there is an effect, and you don’t have enough data to say there is a lack of a meaningful effect). These four situations are visualized below.

There are several papers throughout the scientific disciplines telling us to use equivalence testing. I’m definitely not the first. But in my experience, the trick to get people to use better statistical approaches is to make it easy to do so. I’ll work on a manuscript that tries to make these tests easy to use (if you read this post this far, and work for a journal that might be interested in this, drop me a line – I’ll throw in an easy to use spreadsheet just for you). Thinking about meaningful effects in terms of standardized effect sizes and being able to perform these test based on summary statistics might just do the trick. Try it. 








Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4. http://doi.org/10.3389/fpsyg.2013.00863

Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023

Rogers, J. L., Howard, K. I., & Vessey, J. T. (1993). Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin, 113(3), 553.

Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680.