The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Saturday, November 12, 2016

Why Within-Subject Designs Require Fewer Participants than Between-Subject Designs

One widely recommended approach to increase power is using a within subject design. Indeed, you need fewer participants to detect a mean difference between two conditions in a within-subjects design (in a dependent t-test) than in a between-subjects design (in an independent t-test). The reason is straightforward, but not always explained, and even less often expressed in the easy equation below. The sample size needed in within-designs (NW) relative to the sample needed in between-designs (NB), assuming normal distributions, is (from Maxwell & Delaney, 2004, p. 561, formula 45): 

NW = NB (1-ρ)/2

The “/2” part of the equation is due to the fact that in a two-condition within design every participant provides two data-points. The extent to which this reduces the sample size compared to a between-subject design depends on the correlation between the two dependent variables, as indicated by the (1-ρ) part of the equation. If the correlation is 0, a within-subject design simply needs half as many participants as a between-subject design (e.g., 64 instead 128 participants). The higher the correlation, the larger the relative benefit of within designs, and whenever the correlation is negative (up to -1) the relative benefit disappears. Note than when the correlation is -1, you need 128 participants in a within-design and 128 participants in a between-design, but in a within-design you will need to collect two measurements from each participant, making a within design more work than a between-design. However, negative correlations between dependent variables in psychology are rare, and perfectly negative correlations will probably never occur.

So what does the correlation do so that it increases the power of within designs, or reduces the number of participants you need? Let’s see what effect the correlation has on power by simulating and plotting correlated data. In the R script below, I’m simulating two measurements of IQ scores with a specific sample size (i.e., 10000), mean (i.e., 100 vs 106), standard deviation (i.e., 15), and correlation between the two measurements. The script generates three plots.

We will start with a simulation where the correlation between measurements is 0. First, we see the two normally distributed IQ measurements, with means of 100 and 106, and standard deviations of 15 (due to the large sample size, the numbers equal the input in the simulation, although small variation might still occur).


In the scatter plot, we can see that the correlation between the measurements is indeed 0.

Now, let’s look at the distribution of the mean differences. The mean difference is -6 (in line with the simulation settings), and the standard deviation is 21. This is also as expected. The standard deviation of the difference scores is 2 times as large as the standard deviation in each measurement, and indeed, 15*2 = 21.21, which is rounded to 21. This situation where the correlation between measurements is zero equals the situation in an independent t-test, where the correlation between measurements is not taken into account. 

Now let’s increase the correlation between dependent variables to 0.7.

Nothing has changed when we plot the means:

The correlation between measurements is now strongly positive:

The important difference lies in the standard deviation of the difference scores. The SD = 11 instead of 21 in the simulation above. Because the standardized effect size is the difference divided by the standard deviation, the effect size (Cohen’s dz in within designs) is larger in this test than in the test above.

We can make the correlation more extreme, by increasing the correlation to 0.99, after which the standard deviation of the difference scores is only 2.

If you run the R code below, you will see that if you set the correlation to a negative value, the standard deviation of the difference scores actually increases. 

I like to think of dependent variables in within-designs as dance partners. If they are well-coordinated (or highly correlated), one person steps to the left, and the other person steps to the left the same distance. If there is no coordination (or no correlation), when one dance partner steps to the left, the other dance partner is just as likely to move to the wrong direction as to the right direction. Such a dance couple will take up a lot more space on the dance floor.

You see that the correlation between dependent variables is an important aspect of within designs. I recommend explicitly reporting the correlation between dependent variables in within designs (e.g., participants responded significantly slower (M = 390, SD = 44) when they used their feet than when they used their hands (M = 371, SD = 44, r = .953), t(17) = 5.98, p < 0.001, Hedges' g = 0.43, Mdiff = 19, 95% CI [12; 26]). 

Since most dependent variables in within designs in psychology are positively correlated, within designs will greatly increase the power you can achieve given the sample size you have available. Use within-designs when possible, but weigh the benefits of higher power against the downsides of order effects or carryover effects that might be problematic in a within-subject design. Maxwell and Delaney's book (Chapter 11) has a good discussion of this topic.

Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data: a model comparison perspective (2nd ed). Mahwah, N.J: Lawrence Erlbaum Associates.

Thursday, October 6, 2016

Improving Your Statistical Inferences Coursera course

I’m really excited to be able to announce my “Improving Your Statistical Inferences” Coursera course. It’s a free massive open online course (MOOC) consisting of 22 videos, 10 assignments, 7 weekly exams, and a final exam. All course materials are freely available, and you can start whenever you want.

In this course, I try to teach all the stuff I wish I had learned when I was a student. It includes the basics (e.g., how to interpret p-values, what likelihoods and Bayesian statistics are, how to control error rates or calculate effect sizes) to what I think should also be the basics (e.g., equivalence testing, the positive predictive value, sequential analyses, p-curve analysis, open science). The hands on assignments will make sure you don’t just hear about these things, but know how to use them.

My hope is that busy scholars who want to learn about these things now have a convenient and efficient way to do so. I’ve taught many workshops, but there is only so much you can teach in one or two days. Moreover, most senior researchers don’t even have a single day to spare for education. When I teach PhD students about new methods, their supervisors often respond by saying ‘I've never heard of that, I don't think we need it’. It would be great if everyone has the time to watch some of my videos while doing the ironing, chopping vegetables, or doing the dishes (these are the times I myself watch Coursera videos), and see the need to change some research practices.

This content was tried out and developed over the last 4 years in lectures and workshops for hundreds of graduate students around the world – thank you all for your questions and feedback! Recording these videos was made possible by a grant from by the 4TU Centre for Engineering Education at the recording studio of the TU Eindhoven (if you need a great person to edit your videos, contact Tove Elfferich). The assignments were tested by Moritz Körber, Jill Jacobson, Hanne Melgård Watkins, and around 50 beta-testers who tried out the course in the last few weeks (special shout-out to Lilian Jans-Beken, the first person to complete the entire course!). I really enjoy seeing the positive feedback

Tim van der Zee helped with creating exam questions, and Hanne Duisterwinkel at the TU Eindhoven helped with all formalities. Thanks so much to all of you for your help.

This course is brand new – if you follow it, feel free to send feedback and suggestions for improvement.

I hope you enjoy the course.

Sunday, September 18, 2016

Why scientific criticism sometimes needs to hurt

I think it was somewhere in the end of 2012 when my co-authors and I received an e-mail from Greg Francis pointing out that a study we published on the relationship between physical weight and importance was ‘too good to be true’. This was a stressful event. We were extremely uncertain about what this meant, but we realized it couldn’t be good. For me, it was the first article I had ever published. What did we do wrong? How serious was this allegation? What did it imply about the original effect? How would this affect our reputation?

As a researcher who gets such severe criticism, you have to go through the 5 stages of grief. Denial (‘This doesn’t make any sense at all’), anger (‘Who is this asshole?’), negotiation (‘If he would have taken into account this main effect which was non-significant, our results wouldn’t be improbable!’), depression (‘What a disaster’), until, finally, you reach acceptance (‘OK, he has somewhat of a point’).

In keeping with the times, we had indeed performed multiple comparisons without correcting, and didn’t report one study that had not revealed a significant effect (which we immediately uploaded to PsychFileDrawer).

Before Greg Francis e-mailed us, I probably had heard about statistical power, and knew about publication bias, but receiving this personal criticism forced me to kick my understanding about these issues to a new level. I started to read about the topic, and quickly understood that you can’t have exclusively significant sets of studies in scientific articles, even when there is a true effect (see Schimmack, 2012, for a good explanation). Oh, it felt unfair to be singled out, when everyone else had a file-drawer. We joked that we would from now on only submit one-study papers to avoid such criticism (the test for excessive significance can only be done on multiple study papers). And we didn’t like the tone. “Too good to be true” sounds a lot like fraud, while publication bias sounds almost as inevitable as death and taxes.

But now that some time has passed, I think about this event quite differently. I wonder where I would be without having had this criticism. I was already thinking about ‘Slow Science’ as we tended to call it in 2010, and had written about topics such as reward structures and the importance of replication research early in 2012. But if no-one had told me explicitly and directly that I was doing things wrong, would I have been equally motivated to change the way I do science? I don’t think so. There is a difference between knowing something is important, and feeling something is important. I had the opportunity to read about these topics for years, but all of a sudden, I actually was reading about these topics. Personal criticism was, at least for me, a strong motivating force.

I shouldn’t be surprised by this as a psychologist. I know there is the value-action gap (the difference between saying something is important, and acting based on those beliefs). It makes sense that it took slightly hurtful criticism for me to really be motivated to ignore current norms in my field, and take the time and effort to reflect on what I thought would be best practices.

I’m not saying that criticism has to be hurtful. Sometimes, people who criticize others can try to be a bit more nuanced when they tell the 2726th researcher who gets massive press attention based on a set of underpowered studies with all p-values between 0.03 and 0.05 that power is ‘pretty important’ and the observed results are ‘slightly unlikely’ (although I can understand they might be sometimes a bit too frustrated to use the most nuanced language possible). But I also don’t know how anyone could have brought the news that one of my most-cited papers was probably nothing more than a fluke in a way that I would not have felt stressed, angered, and depressed, as a young untenured researcher who didn’t really understand the statistical problems well enough.

This week, a large scale replication of one of the studies on the weight-importance effect was published. There was no effect. When I look at how my co-authors and myself responded, I am grateful for having received the criticism by Greg Francis years before this large scale replication was performed. Had a failure to replicate our work been the very first time I had been forced to think about the strength of our original research, I fear I might have been one of those scholars that responds defensively to failures to replicate their work. It would be likely that we would have only made it to the ‘anger’ stage in the 5 steps towards acceptance. Without having had several years to improve our understanding of the statistical issues, we would likely have written a very different commentary. Instead, we simply responded by stating: “We have had to conclude that there is actually no reliable evidence for the effect.”

I wanted to share this for two reasons.

First, I understand the defensiveness in some researchers. Getting criticism is stressful, and reduces the pleasure in your work. You don’t want to spend time having to deal with these criticisms, or feel insecure about how well you are actually able to do good science. I’ve been there, and it sucks. Based on my pop-science understanding of the literature on grief processing, I’m willing to give you a month for every year that you have been in science to go through all 5 stages. After a forty-year career, be in denial for 8 months. Be angry for another 8. But after 3 years, I expect you’ll slowly start to accept things. Maybe you want to cooperate with a registered replication report about your own work. Or maybe, if you are still active as a researcher, you want to test some of the arguments you proposed while you were in denial or negotiating, in a pre-registered study.

The second reason I wanted to share this is much more important. As a scientific community, we are extremely ungrateful to people who express criticism. I think the way we treat people who criticize us is deeply shameful. I see people who suffer blatant social exclusion. I see people who don’t get the career options they deserve. I see people whose work is kept out of prestigious journals. Those who criticize us have nothing to gain, and everything to lose. If you can judge a society by how it treats it weakest members, psychologists don’t have a lot to be proud of in this area. 

So here, I want to personally thank everyone who has taken the time to criticize my research or thoughts. I know for a fact that while it happened, I wasn’t even close to as grateful as I should have been. Even now, the eight weeks of meditation training I did two years ago will not be enough for me not to feel hurt when you criticize me. But in the long run, feel comforted that I am grateful for every criticism that forces me to have a better understanding of how to do the best science I can do.