Fiddling while the data burns

Two recent articles highlight one of the adverse consequences of our fixation on p<0.05. It became common during the 20th century to state that if your statistical test shows that the results you found are unlikely to be due to chance, at a level of less than 1 time in 20, then they were significant. The scientific community could easily have decided on 1 chance in 18 or 1 chance in 22, but 1 in 20 is a nice round number on a base 10 number system. (If we had evolved to have 11 fingers it would probably be 1 in 22!)

Anyway when you spend a lot of time and energy doing a research project, and you think you data are great, but the statistical test comes out as p=0.058 what do you do? There is an often unconscious feeling that maybe there was an outlier that you should eliminate, maybe the data should be analyzed by a different test, or you should transform it before doing the test. Low and behold, after several tries at fiddling with the data, the p value is 0.048 and you can use the hallowed word “significant”.

A recently published study provides some evidence that this actually happens. Masicampo EJ, Lalande DR: A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology 2012:1-9. They examined the main p-values reported in psychology journals, what they showed was something that shouldn’t happen, that p-values just slightly below 0.05 were more common than they should be.

That little circle just below .05 is way higher than it should be, people have been fiddling with their results! Imagine if we did have 11 fingers, a critical p value of .045 would make all those results non-significant.

Another new publication has suggested a way to show this, the authors reckoned that if people are fiddling, then there should be fewer than expected p values which are just above 0.05, as they will have been “adjusted” downwards. Gadbury GL, Allison DB: Inappropriate fiddling with statistical analyses to obtain a desirable p-value: Tests to detect its presence in published literature. PLoS ONE 2012, 7(10):e46363. So they developed methods to compare the proportions of p-values in different ranges, just above 0.05 and just below 0.1. Unfortunately their method only works on large data sets of p values, and can’t be used for an individual study.

The main reason this sort of thing happens, which distorts the medical literature, is the bias in publication of significant results, and the relative difficulty of publishing results if the p value is a bit too big. It was suggested years ago (Gardner MJ, Altman DG: Confidence intervals rather than p values: Estimation rather than hypothesis testing. Brit Med J 1986, 292:746-750, among others.) that we should stop publishing p values, and instead publish estimates of effect and confidence intervals. I think we could take this further, journals should review only the methods of a paper, if the methodology is sound, they could then commit to publishing, and afterwards see the results and discussion, for editing and formatting, and ensuring that the discussion is appropriate. They could demand that authors present estimates of effect and confidence intervals, and that any calculation of the statistical likelihood that the results are due to chance alone (significance testing) be presented without reference to any particular threshold.

About Keith Barrington

I am a neonatologist and clinical researcher at Sainte Justine University Health Center in Montréal

View all posts by Keith Barrington →

This entry was posted in Neonatal Research. Bookmark the permalink.