The 20% Statistician

Friday, February 9, 2024

Why Effect Sizes Selected for Significance are Inflated

Estimates based on samples from the population will show variability. The larger the sample, the closer our estimates will be to the true population values. Sometimes we will observe larger estimates than the population value, and sometimes we will observe smaller values. As long as we have an unbiased collection of effect size estimates, combining effect sizes estimates through a meta-analysis can increase the accuracy of the estimate. Regrettably, the scientific literature is often biased. It is specifically common that statistically significant studies are published (e.g., studies with p values smaller than 0.05) while studies with p values larger than 0.05 remain unpublished (Ensinck & Lakens, 2023; Franco et al., 2014; Sterling, 1959). Instead of having access to all effect sizes, anyone reading the literature only has access to effects that passed a significance filter. This will introduce systematic bias in our effect size estimates.

The explain how selection for significance introduces bias, it is useful to understand the concept of a truncated or censored distribution. If we want to measure the average length of people in The Netherlands we would collect a representative sample of individuals, measure how tall they are, and compute the average score. If we collect sufficient data the estimate will be close to the true value in the population. However, if we collect data from participants who are on a theme park ride where people need to be at least 150 centimeters tall to enter, the mean we compute is based on a truncated distribution where only individuals taller than 150 cm are included. Smaller individuals are missing. Imagine we have measured the height of two individuals in the theme park ride, and they are 164 and 184 cm tall. Their average height is (164+184)/2 = 174 cm. Outside the entrance of the theme park ride is one individual who is 144 cm tall. Had we measured this individual as well, our estimate of the average length would be (144+164+184)/3 = 164 cm. Removing low values from a distribution will lead to overestimation of the true value. Removing high values would lead to underestimation of the true value.

The scientific literature suffers from publication bias. Non-significant test results – based on whether a p value is smaller than 0.05 or not – are often less likely to be published. When an effect size estimate is 0 the p value is 1. The further removed effect sizes are from 0, the smaller the p value. All else equal (e.g., studies have the same sample size, and measures have the same distribution and variability) if results are selected for statistical significance (e.g., p < .05) they are also selected for larger effect sizes. As small effect sizes will be observed with their corresponding probabilities, their absence will inflate effect size estimates. Every study in the scientific literature provides it’s own estimate of the true effect size, just as every individual provides it’s own estimate of the average height of people in a country. When these estimates are combined – as happens in meta-analyses in the scientific literature – the meta-analytic effect size estimate will be biased (or systematically different from the true population value) whenever the distribution is truncated. To achieve unbiased estimates of population values when combining individual studies in the scientific literature in meta-analyses researchers need access to the complete distribution of values – or all studies that are performed, regardless of whether they yielded a p value above or below 0.05.

In the figure below we see a distribution centered at an effect size of Cohen’s d = 0.5 for a two-sided t-test with 50 observations in each independent condition. Given an alpha level of 0.05 in this test only effect sizes larger than d = 0.4 will be statistically significant (i.e., all observed effect sizes in the grey area). The threshold for which observed effect sizes will be statistically significant is determined by the sample size and the alpha level (and not influenced by the true effect size). The white area under the curve illustrates Type 2 errors – non-significant results that will be observed if the alternative hypothesis is true. If researchers only have access to the effect sizes estimates in the grey area – a truncated distribution where non-significant results are removed – a weighted average effect size from only these studies will be upwardly biased.

If researchers only have access to the effect sizes estimates in the grey area – a truncated distribution where non-significant results are removed – a weighted average effect size from only these studies will be upwardly biased. We can see this in the two forest plots visualizing meta-analyses below. In the top meta-analysis all 5 studies are included, even though study C and D yield non-significant results (as can be seen from the fact that the 95% CI overlaps with 0). The estimated effect size based on all 5 studies is d = 0.4. In the bottom meta-analysis the two non-significant studies are removed - as would happen when there is publication bias. Without these two studies the estimated effect size in the meta-analysis, d = 0.5, is inflated. The extent to which meta-analyses are inflated depends on the true effect size and the sample size of the studies.

The inflation will be greater the larger the part of the distribution is truncated, and the closer the true population effect size is to 0. In our example about the height of individuals the inflation would be greater had we truncated the distribution by removing everyone smaller than 170 cm instead of 150 cm. If the true average height of individuals was 194 cm, removing the few people that are expected to be smaller than 150 (based on the assumption of normally distributed data) would have less of an effect on how much our estimate is inflated than when the true average height was 150 cm, in which case we would remove 50% of individuals. In statistical tests where results are selected for significance at a 5% alpha level more data will be removed if the true effect size is smaller, but also when the sample size is smaller. If the sample size is smaller, statistical power is lower, and more of the values in the distribution (those closest to 0) will be non-significant.

Any single estimate of a population value will vary around the true population value. The effect size estimate from a single study can be smaller than the true effect size, even if studies have been selected for significance. For example, it is possible that the true effect size is 0.5, you have observed an effect size of 0.45, but only effect sizes smaller than 0.4 are truncated when selecting studies based on statistical significance (as in the figure above). At the same time, this single effect size estimate of 0.45 is inflated. What inflates the effect size is the long-run procedure used to generate the value. In the long run effect sizes estimates based on a procedure where estimates are selected for significance will be upwardly biased. This means that a single observed effect size of d = 0.45 will be inflated if it is generated based on a procedure where all non-significant effects are truncated, but it will be unbiased if it is generated based on a distribution where all observed effect sizes are reported, regardless of whether they are significant or not. This also means that a single researcher can not guarantee that the effect sizes they contribute to a literature will contribute to an unbiased effect sizes estimate: There needs to be a system in place where all researchers report all observed effect sizes to prevent bias. An alternative is to not have to rely on other researchers, and collect sufficient data in a single study to have a highly accurate effect size estimate. Multi-lab replication studies are an example of such an approach, where dozens of researchers collect a large number (up to thousands) of observations.

The most extreme consequence of the inflation of effect size estimates occurs when the true effect size in the population is 0, but due to selection of statistically significant results, only significant effects in the expected direction are published. Note that if all significant results are published (and not only effect sizes in the expected direction) 2.5% of Type 1 error rates will be in the positive direction, and 2.5% will be in the negative direction, and the average effect size would be actually be 0. Thus, as long as the true effect size is exactly 0, and all Type 1 errors are published, the effect size estimate would be unbiased. In practice, we see scientists often do not simply publish all results, but only statistically significant results in the desired direction. An example of this is the literature on ego-depletion, where hundreds of studies were published, most showing statistically significant effects, but unbiased large scale replication studies revealed effect sizes of 0 (Hagger et al., 2015; Vohs et al., 2021).

What can be done about the problem of biased effect sizes estimates if we mainly have access to the studies that passed a significance filter? Statisticians have developed approaches to adjust biased effect size estimates by taking a truncated distribution into account (Taylor & Muller, 1996). This approach has recently been implemented in R (Anderson et al., 2017). Implementing this approach in practice is difficult, because we never know for sure if an effect size estimate is biased, and if it is biased, how much bias there is. Furthermore, selection based on significance is only one form of bias, whereas researchers who selectively report significant results may engage in additional problematic research practices, such as selectively reporting results, which are not accounted for in the adjustment. Other researchers have referred to this problem as a Type M error (Gelman & Carlin, 2014; Gelman & Tuerlinckx, 2000) and have suggested that researchers always report the average inflation factor of effect sizes. I do not believe this approach is useful. The Type M error is not an error, but a bias in estimation, and it is more informative to compute the adjusted estimate based on a truncated distribution as proposed by Taylor and Muller in 1996, than to compute the average inflation for a specific study design. If effects are on average inflated by a factor of 1.3 (the Type M error) it does not mean that the observed effect size is inflated by this factor, and the truncated effect sizes estimator by Taylor and Muller will provide researchers with an actual estimate based on their observed effect size. Type M errors might have a function in education, but they are not useful for scientists (I will publish a paper on Type S and M errors later this year, explaining in more detail why I think neither are useful concepts).

Of course the real solution to bias in effect size estimates due to significance filters that lead to truncated or censored distributions is to stop selectively reporting results. Designing highly informative studies that have high power to both reject the null, as a smallest effect size of interest in an equivalence test, is a good starting point. Publishing research as Registered Reports is even better. Eventually, if we do not solve this problem ourselves, it is likely that we will face external regulatory actions that force us to include all studies that have received ethical review board approval to a public registry, and update the registration with the effect size estimate, as is done for clinical trials.

References:

Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://doi.org/10.1177/0956797617723724

Ensinck, E., & Lakens, D. (2023). An Inception Cohort Study Quantifying How Many Registered Studies are Published. PsyArXiv. https://doi.org/10.31234/osf.io/5hkjz

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502–1505. https://doi.org/10.1126/SCIENCE.1255484

Gelman, A., & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641–651.

Gelman, A., & Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics, 15(3), 373–390. https://doi.org/10.1007/s001800000040

Hagger, M. S., Chatzisarantis, N. L., Alberts, H., Anggono, C. O., Batailler, C., Birt, A., & Zwienenberg, M. (2015). A multi-lab pre-registered replication of the ego-depletion effect. Perspectives on Psychological Science, 2.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—Or vice versa. Journal of the American Statistical Association, 54(285), 30–34. JSTOR. https://doi.org/10.2307/2282137

Taylor, D. J., & Muller, K. E. (1996). Bias in linear model power and sample size calculation due to estimating noncentrality. Communications in Statistics-Theory and Methods, 25(7), 1595–1610. https://doi.org/10.1080/03610929608831787

Vohs, K. D., Schmeichel, B. J., Lohmann, S., Gronau, Q. F., Finley, A. J., Ainsworth, S. E., Alquist, J. L., Baker, M. D., Brizi, A., Bunyi, A., Butschek, G. J., Campbell, C., Capaldi, J., Cau, C., Chambers, H., Chatzisarantis, N. L. D., Christensen, W. J., Clay, S. L., Curtis, J., … Albarracín, D. (2021). A Multisite Preregistered Paradigmatic Test of the Ego-Depletion Effect. Psychological Science, 32(10), 1566–1581. https://doi.org/10.1177/0956797621989733

Thursday, January 11, 2024

Surely God loves 51 km/h nearly as much as 49 km/h?

Next time you get a fine for speeding, I suggest you try the following line of defense. First, you explain to the judge that a speed limit of 50 km/h in densely populated areas is a convention. It could just as easily have been set at 51 km/h, or 49 km/h. There is no bright line at 50 km/h that prevents all accidents, because the number of deaths due to speeding is a continuous variable. Tell the judge that you believe it is better if drivers ignore speed limits, and instead drive in a thoughtful, open, and modest manner. Tell the judge that, surely, God loves 51 kilometers per hour nearly as much as 49 kilometers an hour. I strongly suspect the judge will roll their eyes, and instructs you to pay the fine.

And yet, when it comes to statistics, these are exactly the arguments statisticians bring forward to criticize current rules in science that exist to regulate when scientists can make claims based on tests. They will argue against dichotomous decisions, and in favor of being “thoughtful, open, and modest” (Wasserstein et al., 2019). Statisticians are like drivers. They deal with individual studies, like drivers deal with their own car. Driving one kilometer faster or slower feels like an arbitrary choice, just as how making a claim based on p < 0.06 or p < 0.04 is arbitrary for a statistician. And in the individual world of drivers and statisticians, there is no logical argument to treat driving 51 km/h at this time and on this street differently from driving 49 km/h.

Philosophers of science are like the government. They do not deal with individual studies, but they deal with the scientific system, just as governments deal with the traffic management system. At this higher level it sometimes becomes necessary to set rules, and enforce them. For example, as explained by the European Road Safety Observatory, originally the driving speed was largely determined by drivers, and fixed at the 85^th percentile of the speed on a road. If all drivers drive at a high speed, the maximum speed would be high, if drivers drove at a low speed, the maximum speed would be lower. Such a system is sometimes also advocated by statisticians: Let the community decide what is best, without strict top-down rules. Regrettably, it is often not responsible to let the community make their own rules. As the European Road Safety Observatory notes: “However, many behavioural observations, attention measurements, and the high number of traffic crashes caused by excessive speed have shown that one cannot always rely on the judgement of drivers to set a suitable speed limit.”

The reason why drivers can not determine how fast they should be allowed to drive is because as a society we want to prevent accidents. The reward structure for drivers is such that if they speed and do not get into an accident, they get to where they want to be more quickly, and when they speed and do get into an accident, they might kill a pedestrian or bicyclist. If we ignore having a bad conscience (combined the inability of drivers to adequately estimate the probability they will get into an accident), the reward structure would lead unacceptable risks for pedestrians. If we left the criteria that allow a scientists to make a claim up to scientists themselves, the reward structure would lead to unacceptable rates of false claims.

If someone told you they were speeding to be on time for a meeting, you would likely not scold them, or report them to the police. It is relatively accepted behavior, at least when someone does not violate the speed limit by too much. According to the European Road Safety Observatory “67% of Europeans admit to having speeded on rural roads over the previous 30 days”. And yet, reducing the average driving speed with 1 additional kilometer would save more than 2000 lives a year. Those small violations that we find acceptable have real consequences that we are often not aware of when we violate the rules. Scientists also admit to practices that increase the probability of false claims, and no one will be fired for not correcting for multiple comparisons, even if this in practice leads to a higher Type 1 error rate than the 5% they say they will use to make scientific claims.

Enforcing rules can prevent accidents and errors. And therefore, the driver who tries to convince the judge that ‘surely God loves 51 km/h nearly as much as 49 km/h’ will have little success. The judge knows that enforcing violations saves lives. In practice, drivers need to speed by more than 1 km/h to get a fine, due to corrections for measurement uncertainty. In The Netherlands, 3 km is subtracted from the speed measurement to guarantee a driver was speeding, given imperfect measurement equipment. According to the European Road Safety Observatory “The detection equipment is often set in such a way that there is a margin of tolerance with regard to the speed limit. The use of such margins of tolerance serves to filter out minor, accidental violations and to deal with the possible unreliability of the equipment. A disadvantage of this approach is, however, that it strengthens drivers' opinion that a minor offence is not so serious”. Similarly, in science, if we allow author to not correct for multiple comparisons, or make claims based on ‘marginally significant’ findings of p = 0.06 they might similarly feel the consequences are not so serious.

But at the system level, a Type 1 error rate of 10% instead of 5% has a massive impact on the safety and efficiency of the scientific system. Whether acceptable safety is reached by setting the alpha at 5% deserves to be empirically studied (just as the acceptable driving speed is determined empirically). Just as driving speeds, we might find different alpha levels acceptable in different research lines. But that a driving speed has to be established and enforced will remain important on a system level.

Some drivers will continue to complain about being fined for speeding, convinced as they are that they can determine how fast they can drive at a specific time at a specific road. Some people will never like being told what to do. Some statisticians will continue to complain that they need to adhere to a 5% error rate when making scientific claims, when they strongly believe that they can determine when they should make a claim, and when not, on a case by case basis.

We allow drivers to voice their complaints, and if there are signals that traffic rules lead to problems, they might be adjusted. And of course, no government is perfect, so suboptimal decisions will sometimes be made. But we will never abandon traffic rules, and will at best change the driving speed that will be enforced, or the parts of the road where driving speeds will be enforced. Similarly, we allow statisticians to complain about the use of significance levels to make claims. But we will never abandon the use of enforced criteria that regulate when scientists can make claims, and will at best change alpha levels, or decrease the amount of research questions that test claims in favor of descriptive research. When it comes to decisions about how we should organize the traffic management system, we don’t ask drivers. Similarly, when it comes to decisions about how to organize scientific knowledge generation, we don’t ask statisticians. Scientific knowledge generation is studied by social epistemologists. Science, like driving, is a social system with a specific goal. It is of course beneficial if those government employees that create the traffic management system are also drivers, just as it is useful if social epistemologists understand statistics. But social epistemology is it’s own specialization.

Some scientists don’t like to think of science as a large ‘knowledge production system’. Maybe it makes them feel like a cog in a machine. I like to think of scientists as part of a system. Our jobs are very similar to garbage collectors. It’s a large and essential system that exists because society needs it and is willing to pay for it, that aims to achieve a goal efficiently, with a strong social component. Therefore, it makes sense to me that science needs a set of rules to reduce errors in the system. Not all scientists will agree, just as not all civilians agree with the government. From a statistical perspective, there might not be a difference between driving 49 or 51 km/h, but from a social epistemological perspective it is justifiable to fine a driver if they drive 51 km/h inside city limits, and not fine them if they drive 49 km/h.

Wednesday, August 23, 2023

Reflections on the viral thread by Dr Louise Raw spreading fake news about unethically performed radiation experiments on Punjabi women in the 1950's

On the 19^th of August, Dr Louise Raw wrote a series of tweets that spread fake news about unethically performed radiation experiments on Punjabi women in the 1950’s. The tweet went viral, and I saw many academics I follow on Twitter uncritically retweet this fake news.

I read the thread. It is well written – a masterclass in the spread of fake news. It starts with kindness, only to later spring the trap of abuse. As you read the thread, you will be surprised, and your emotions will get the better of you. It is very effective.

But then we come to the main claim.

When I read this, alarm bells go off. First of all, the main source is a filmmaker. That is an unreliable source for me. Some filmmakers of course tell the truth, but I have also seen a massive amount of documentaries that twist facts – after all, these people need to make money by selling their documentary, and the truth does not always sell. Now, if this really was unethical research of Tuskegee Syphilis level (https://www.cdc.gov/tuskegee/timeline.htm) you would have both the truth, and a best selling documentary. So, it can be true, but it starts with a very weak source.

The second point in this tweet (screenshot above) about the chapatis almost made me laugh out loud, it is this silly to me. They performed this experiment to examine effects of radioactive food, choosing Punjabi women, and feeding them radioactive chapatis. Now, maybe it is because I do experiments, and I like to do them efficiently. But if I want to unethically test the effect of radioactive food, I would go to a prison, or a mental health facility, where people are already getting food served to them. I don’t try to find a community of Punjabi women, and bake them chapatis.

At this point, I do not trust this thread and the claims. This sounds ridiculously implausible. And when things are ridiculously implausible, they often are lies. I evaluate the probability that some activist like Louis Raw would fall for such a lie – and I think that probability is extremely high.

A logical thing to do now is search for other sources. Surely, someone must have written about this somewhere. One of the first articles I find is https://www.independent.co.uk/news/britons-used-in-radiation-tests-affected-200-1353929.html. It says the claims were made by the "Campaign for Nuclear Disarmament", an activist group founded in 1957 in the wake of widespread fear of nuclear conflict and the effects of nuclear tests. I reduce my confidence in the accusations even more - activists can be right, but my prior is there is also great risk of bias. A second source I find is https://www.jstor.org/stable/25179350. The title ‘MRC cleared of unethical research practices’ reduced my belief in the claims even more. I know that it is possible there 1) was an unethical experiment, and 2) an investigation into the unethical experiment suffered from a huge conspiracy level cover up. But, for example, when a press story about the Syphilis Study at Tuskegee broke in 1972 the independent committee investigating it just found clear unethical conduct and said so. I would now need to believe researchers performed an unnecessarily complex study, AND there was a government cover up. That is getting a bit too much for me.

The investigation revealed one act of unethical behavior, where the doctors removed the femora of a baby that had died and the mother was refused the body of the baby. That sounds unacceptable to me, but it has nothing to do with the claims of the use of radiation.

The article provides a great possibility to quote out of context when it says “The inquiry concluded: "Many of the studies would not be considered acceptable now"”. But what is not deemed acceptable now is that participants were not updated about the results of the study – again, this has nothing to do with radiation.

Another article reporting on this outcome summarized it as follows “But the five-member committee did conclude that some of the participants had been caused long-term worry and distress, which "could have been avoided by better communication by researchers during the studies, and by ensuring information on the study was readily available afterwards." https://pubmed.ncbi.nlm.nih.gov/9662351/

I started to tweet my conclusion that this was a fake news story. Richard van Noorden provided me with an additional link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2550248/

This piece is written by K E Halnan, the doctor involved in the study. He studied cancer and iodines and published over 70 papers in his lifetime on related topics. His obituary states “Having seen the horrors of war, he resolved to take up medicine in order “to make more of a difference”.” https://www.hkmj.org/system/files/hkm0604p167.pdf

Doctor Halnan was upset when the documentary was shown on Channel 4 on July 6, 1995. To be clear, this story about unethical conduct was not hidden away for the last 30 years – it was broadcasted on prime time, and caused a great stir. The reason it is not remembered, is because later investigations showed it was not true. I think if you get accused by people who misrepresent the research you have done, you would be upset enough to write a letter about it as well. Note there was no reason for Halnan to write this personal letter where he identifies himself as one of the researchers. He could have remained anonymous.

Doctor Halnan tried to explain why the study was performed. The experiment was performed out of concern over widespread anaemia among Asian women due to iron deficiency due to traditional food this population ate. For me, this is a much more logical reason to do a study in Punjabi women and baking them chapatis.

I am not writing this blog to accuse anyone for making a fool of themselves. It is true that I was disappointed in the many academics I saw retweeting this fake news. It also reminded me we need to do a lot more to train critical thinking skills in people. Maybe this blog can help a little in training people to evaluate whether a tweet contains fake news.