The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, October 6, 2016

Improving Your Statistical Inferences Coursera course

I’m really excited to be able to announce my “Improving Your Statistical Inferences” Coursera course. It’s a free massive open online course (MOOC) consisting of 22 videos, 10 assignments, 7 weekly exams, and a final exam. All course materials are freely available, and you can start whenever you want.

In this course, I try to teach all the stuff I wish I had learned when I was a student. It includes the basics (e.g., how to interpret p-values, what likelihoods and Bayesian statistics are, how to control error rates or calculate effect sizes) to what I think should also be the basics (e.g., equivalence testing, the positive predictive value, sequential analyses, p-curve analysis, open science). The hands on assignments will make sure you don’t just hear about these things, but know how to use them.

My hope is that busy scholars who want to learn about these things now have a convenient and efficient way to do so. I’ve taught many workshops, but there is only so much you can teach in one or two days. Moreover, most senior researchers don’t even have a single day to spare for education. When I teach PhD students about new methods, their supervisors often respond by saying ‘I've never heard of that, I don't think we need it’. It would be great if everyone has the time to watch some of my videos while doing the ironing, chopping vegetables, or doing the dishes (these are the times I myself watch Coursera videos), and see the need to change some research practices.

This content was tried out and developed over the last 4 years in lectures and workshops for hundreds of graduate students around the world – thank you all for your questions and feedback! Recording these videos was made possible by a grant from by the 4TU Centre for Engineering Education at the recording studio of the TU Eindhoven (if you need a great person to edit your videos, contact Tove Elfferich). The assignments were tested by Moritz Körber, Jill Jacobson, Hanne Melgård Watkins, and around 50 beta-testers who tried out the course in the last few weeks (special shout-out to Lilian Jans-Beken, the first person to complete the entire course!). I really enjoy seeing the positive feedback

Tim van der Zee helped with creating exam questions, and Hanne Duisterwinkel at the TU Eindhoven helped with all formalities. Thanks so much to all of you for your help.

This course is brand new – if you follow it, feel free to send feedback and suggestions for improvement.

I hope you enjoy the course.

Sunday, September 18, 2016

Why scientific criticism sometimes needs to hurt

I think it was somewhere in the end of 2012 when my co-authors and I received an e-mail from Greg Francis pointing out that a study we published on the relationship between physical weight and importance was ‘too good to be true’. This was a stressful event. We were extremely uncertain about what this meant, but we realized it couldn’t be good. For me, it was the first article I had ever published. What did we do wrong? How serious was this allegation? What did it imply about the original effect? How would this affect our reputation?

As a researcher who gets such severe criticism, you have to go through the 5 stages of grief. Denial (‘This doesn’t make any sense at all’), anger (‘Who is this asshole?’), negotiation (‘If he would have taken into account this main effect which was non-significant, our results wouldn’t be improbable!’), depression (‘What a disaster’), until, finally, you reach acceptance (‘OK, he has somewhat of a point’).

In keeping with the times, we had indeed performed multiple comparisons without correcting, and didn’t report one study that had not revealed a significant effect (which we immediately uploaded to PsychFileDrawer).

Before Greg Francis e-mailed us, I probably had heard about statistical power, and knew about publication bias, but receiving this personal criticism forced me to kick my understanding about these issues to a new level. I started to read about the topic, and quickly understood that you can’t have exclusively significant sets of studies in scientific articles, even when there is a true effect (see Schimmack, 2012, for a good explanation). Oh, it felt unfair to be singled out, when everyone else had a file-drawer. We joked that we would from now on only submit one-study papers to avoid such criticism (the test for excessive significance can only be done on multiple study papers). And we didn’t like the tone. “Too good to be true” sounds a lot like fraud, while publication bias sounds almost as inevitable as death and taxes.

But now that some time has passed, I think about this event quite differently. I wonder where I would be without having had this criticism. I was already thinking about ‘Slow Science’ as we tended to call it in 2010, and had written about topics such as reward structures and the importance of replication research early in 2012. But if no-one had told me explicitly and directly that I was doing things wrong, would I have been equally motivated to change the way I do science? I don’t think so. There is a difference between knowing something is important, and feeling something is important. I had the opportunity to read about these topics for years, but all of a sudden, I actually was reading about these topics. Personal criticism was, at least for me, a strong motivating force.

I shouldn’t be surprised by this as a psychologist. I know there is the value-action gap (the difference between saying something is important, and acting based on those beliefs). It makes sense that it took slightly hurtful criticism for me to really be motivated to ignore current norms in my field, and take the time and effort to reflect on what I thought would be best practices.

I’m not saying that criticism has to be hurtful. Sometimes, people who criticize others can try to be a bit more nuanced when they tell the 2726th researcher who gets massive press attention based on a set of underpowered studies with all p-values between 0.03 and 0.05 that power is ‘pretty important’ and the observed results are ‘slightly unlikely’ (although I can understand they might be sometimes a bit too frustrated to use the most nuanced language possible). But I also don’t know how anyone could have brought the news that one of my most-cited papers was probably nothing more than a fluke in a way that I would not have felt stressed, angered, and depressed, as a young untenured researcher who didn’t really understand the statistical problems well enough.

This week, a large scale replication of one of the studies on the weight-importance effect was published. There was no effect. When I look at how my co-authors and myself responded, I am grateful for having received the criticism by Greg Francis years before this large scale replication was performed. Had a failure to replicate our work been the very first time I had been forced to think about the strength of our original research, I fear I might have been one of those scholars that responds defensively to failures to replicate their work. It would be likely that we would have only made it to the ‘anger’ stage in the 5 steps towards acceptance. Without having had several years to improve our understanding of the statistical issues, we would likely have written a very different commentary. Instead, we simply responded by stating: “We have had to conclude that there is actually no reliable evidence for the effect.”

I wanted to share this for two reasons.

First, I understand the defensiveness in some researchers. Getting criticism is stressful, and reduces the pleasure in your work. You don’t want to spend time having to deal with these criticisms, or feel insecure about how well you are actually able to do good science. I’ve been there, and it sucks. Based on my pop-science understanding of the literature on grief processing, I’m willing to give you a month for every year that you have been in science to go through all 5 stages. After a forty-year career, be in denial for 8 months. Be angry for another 8. But after 3 years, I expect you’ll slowly start to accept things. Maybe you want to cooperate with a registered replication report about your own work. Or maybe, if you are still active as a researcher, you want to test some of the arguments you proposed while you were in denial or negotiating, in a pre-registered study.

The second reason I wanted to share this is much more important. As a scientific community, we are extremely ungrateful to people who express criticism. I think the way we treat people who criticize us is deeply shameful. I see people who suffer blatant social exclusion. I see people who don’t get the career options they deserve. I see people whose work is kept out of prestigious journals. Those who criticize us have nothing to gain, and everything to lose. If you can judge a society by how it treats it weakest members, psychologists don’t have a lot to be proud of in this area. 

So here, I want to personally thank everyone who has taken the time to criticize my research or thoughts. I know for a fact that while it happened, I wasn’t even close to as grateful as I should have been. Even now, the eight weeks of meditation training I did two years ago will not be enough for me not to feel hurt when you criticize me. But in the long run, feel comforted that I am grateful for every criticism that forces me to have a better understanding of how to do the best science I can do.

Monday, July 18, 2016

Dance of the Bayes factors

You might have seen the ‘Dance of the p-values’ video by Geoff Cumming (if not, watch it here). Because p-values and the default Bayes factors (Rouder, Speckman, Sun, Morey, & Iverson, 2009) are both calculated directly from t-values and sample sizes, we might expect there is also a Dance of the Bayes factors. And indeed, there is. Bayes factors can vary widely over identical studies, just due to random variation.

If people would always correctly interpret Bayes factors, that would not be a problem. Bayes factors tell you how much data are in line with models, and quantify relative evidence in favor of one of these models. The data is what it is, even when it is misleading (i.e., supporting a hypothesis that is not true). So, you can conclude the null model is more likely than some other model, but purely based on a Bayes factor, you can’t draw a conclusion such as “This Bayes factor allows us to conclude that there are no differences between conditions”. Regrettably, researchers are massively starting to misinterpret Bayes factors (I won't provide references, though I have many). This is not surprising – people find statistical inferences difficult, whether these are about p-values, confidence intervals, or Bayes factors.

As a consequence, we see many dichotomous absolute interpretations (“we conclude there is no effect”) instead of continuous relative interpretations (“we conclude the data increase our belief in the null model compared to the alternative model”). As a side note: In my experience some people who advocate Bayesian statistics over NHST often live in a weird Limbo. They believe the null is never true when they are criticizing Null-Hypothesis Significance Testing as a useless procedure because we already know the null is not true, but they love using Bayes factors to conclude the null-hypothesis is supported.

For me, there is one important difference between the dance of the p-values and the dance of the Bayes factors: When people draw dichotomous conclusions, p-values allow you to control your error rate in the long run, while error rates are ignored when people use Bayes factors. As a consequence, you can easily conclude there is ‘no effect’, where there is an effect, 25% of the time (see below). This is a direct consequence of the ‘Dance of the Bayes factors’.

Let’s take the following scenario: There is a true small effect, Cohen’s d = 0.3. You collect data and perform a default two-sided Bayesian t-test with 75 participants in each condition. Let’s repeat this 100.000 times, and plot the Bayes factors we can expect. 

If you like a more dynamic version, check the ‘Dance of the Bayes factors’ R script at the bottom of this post. As output, it gives you a :D smiley when you have strong evidence for the null (BF < 0.1), a :) smiley when you have moderate evidence for the null, a (._.) when data is inconclusive, and a :( or :(( when data strongly support the alternative (smileys are coded based on the assumption researchers want to find support for the null). See the .gif below for the Dance of the Bayes factors if you don’t want to run the script.

I did not choose this example randomly (just as Geoff Cumming did not randomly choose to use 50% statistical power in his ‘Dance of the p-values’ video). In this situation, approximately 25% of Bayes factors are smaller than 1/3 (which can be interpreted as support for the null), 25% are higher than 3 (which can be interpreted as support for the alternative), and 50% are inconclusive. If you would conclude, based on your Bayes factor, that there are no differences between groups, you’d be wrong 25% of the time, in the long run. That’s a lot.

(You might feel more comfortable using a BF of 1/10 as a ‘strong evidence’ threshold: BF < 0.1 happen 12.5% of the time in this simulation. A BF > 10 never happens: We don't have a large enough sample size. If your true effect size is 0.3, you have decided to collect a maximum of 75 participants in each group, and you will look at the data repeatedly until you have ‘strong evidence’ (BF > 10 or BF < 0.1), you will never observe support for the alternative, and you can only observe strong evidence in favor of the null model, even though there is a true effect).

Felix Schönbrodt gives some examples for the probability you will observe a misleading Bayes factor for different effect sizes and priors (Schönbrodt, Wagenmakers, Zehetleitner, & Perugini, 2015). Here, I just want note you might want to take the Frequentist properties of Bayes factors in to account, if you want to make dichotomous conclusions such as ‘the data allow us to conclude there is no effect’. Just as the ‘Dance of the p-values’ can be turned into a ‘March of the p-values’ by increasing the statistical power, you can design studies that will yield informative Bayes factors, most of the time (Schönbrodt & Wagenmakers, 2016). But you can only design informative studies, in the long run, if you take Frequentist properties of tests into account. If you just look ‘at the data at hand’ your Bayes factors might be dancing around. You need to look at their Frequentist properties to design studies where Bayes factors march around. My main point in this blog is that this is something you might want to do.

What’s the alternative? First, never make incorrect dichotomous conclusions based on Bayes factors. I have the feeling I will be repeating this for the next 50 years. Bayes factors are relative evidence. If you want to make statements about how likely the null is, define a range of possible priors, use Bayes factors to update these priors, and report posterior probabilities as your explicit subjective belief in the null.

Second, you might want to stay away from the default priors. Using default priors as a Bayesian is like eating a no-fat no-sugar no-salt chocolate-chip cookie: You might as well skip it. You will just get looks of sympathy as you try to swallow it down. Look at Jeff Rouder’s post on how to roll your own priors.

Third, if you just want to say the effect is smaller than anything you find worthwhile (without specifically concluding there no effect) equivalence testing might be much more straightforward. It has error control, so you won’t incorrectly say the effect is smaller than anything you care about too often, in the long run.

The final alternative is just to ignore error rates. State loudly and clearly that you don’t care about Frequentist properties. Personally, I hope Bayesians will not choose this option. I would not be happy with a literature where thousands of articles claim the null is true, when there is a true effect. And you might want to know how to design studies that are likely to give answers you find informative.

When using Bayes factors, remember they can vary a lot across identical studies. Also remember that Bayes factors give you relative evidence. The null model might be more likely than the alternative, but both models can be wrong. If the true effect size is 0.3, the data might be closer to a value of 0 than to a value of 0.7, but it does not mean the true value is 0. In Bayesian statistics, the same reasoning holds. Your data may be more likely under a null model than under an alternative model, but that does not mean there are no differences. If you nevertheless want to argue that the null-hypothesis is true based on just a Bayes factor, realize you might be fooling yourself 25% of the time. Or more. Or less.


  • Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237.
  • Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential Hypothesis Testing With Bayes Factors: Efficiently Testing Mean Differences. Psychological Methods. Retrieved from
  • Schönbrodt, F. D., & Wagenmakers, E.-J. (2016). Bayes Factor Design Analysis: Planning for Compelling Evidence (SSRN Scholarly Paper No. ID 2722435). Rochester, NY: Social Science Research Network. Retrieved from

Thursday, June 23, 2016

NWO pilot project will fund €3.000.000 worth of replication research

The Times Higher Education reports on two new initiatives in the Netherlands to bolster scientific standards. Here, I want to talk about one of these initiatives I was involved in: A fund for replication research. The board of the Dutch science funder NWO still has to officially approve the final call for proposals, but the draft text is basically done.

Now I’m very proud of the Dutch NWO for taking such a bold step, and fund replication research. I’m also proud of the small role I played in this process. And because I think it’s a nice story of how you can sometimes instigate some change as a single individual, I thought I’d share it.

In 2012, Sander Koole and I published a paper on the importance of rewarding replication research. We decided we would not just write about this topic, but act. And thus we wrote to the Dutch science funder NWO. We explained how performing replication research is like eating your veggies (Brussels sprouts, to be precise): Very few people get enthusiastic about vegetables, but if you don’t eat them you’ll never grow big and strong. Because researchers have spent all their money on sweets, science now has cavities. We need to stimulate healthier behavior. And thus NWO should fund replication research. I love how we recommend that NWO should also included money for an open access publication, which ‘is currently the only place where replication studies are published’. How long ago can 2012 feel?

The response of NWO was classic. They write: “As long as a proposal contains innovative elements, and the results can contribute to the development of science, it fits within NWO calls. If the test of earlier results would for example be done through a new method, the proposal can successfully take part in the NWO competition.” (Our piece, and the reply by NWO, is available below, if you read Dutch).

We can see they did not yet get the point (indeed, the replication grants that are now introduced are not available for conceptual replications, only for replication studies that use the same method).

I was a bit annoyed, to say it politely. And here’s an important life lesson I took away from this: When you are truly frustrated, don’t hold it in. I sent NWO an email. Let me quote myself: “Replication is the very foundation of a robust science. The current NWO policy undermines the foundation of science. If this is not dealt with, NWO will be doing more bad than good for science. There is no other solution then to adjust the current policy.”

Now I had just completed my PhD two years earlier, and I fully expected NWO to ignore my opinion. But remarkably, they didn’t. Instead, they invited me over for a talk. A very nice talk. I explained the problems, and reminded NWO they legally had two tasks: Stimulate novel research, and improve the quality of research. This second task, I argued, could use some more attention. I proposed a possible solution (the current grants will be much larger than what I initially suggested), and everyone became enthusiastic about the idea to fund replication research.

Change takes time, but here we are, some four years later, with €3.000.000 for replication research (to be spread out over three years). Many people at NWO have worked very hard on making this possible, and I’m grateful for all their work. It’s been fun to be at the start of something as exciting as this bold step the Dutch science funder NWO is taking. I look forward to the cool projects researchers will do with these grants.