To p or not to p, that is the question.

I can’t claim preference for this title, although I wish I could. I copied it from an article published in an ENT journal (Buchinsky FJ, Chadha NK. To P or Not to P: Backing Bayesian Statistics. Otolaryngol Head Neck Surg. 2017;157(6):915-8).

I think the word “significant” should be banned. (Not in life; I am not a fascist; you can say whatever you want is significant, but in medical research there is so much confusion about the term that we would be better to never use it!)

I think authors who find a potentially positive result in a good quality study should be allowed to say things like “if there were no other unanticipated biases in our research design, the likelihood that our results are due solely to random variation is less than 1 in 20”, which is less sexy, but more accurate, compared to saying “our results were significant”. (If there are any real statisticians out there reading this, and I say anything which is not accurate, please let me know, I only have basic statistical training and would be happy to be corrected!)

It would certainly be much better than assuming that p<0.05 means that you definitely found an effect, or that p>0.05 means that there is nothing there!

In this blog I usually try to avoid the term “statistically significant” (or not), as the term is often used to imply “proven effect” as compared to “proof of no effect”. I hope we all know that the threshold, where p=0.051 means no effect, and p=0.049 means proven effect, is nonsense. Some journals have banned the reporting of p-values and even confidence intervals, as a result. I think this is extreme, I think we should be able to report confidence intervals, but that multiple confidence intervals, 90, 95, and 99% should perhaps be demanded. And also appropriate wording, similar to what I suggested above. The risk is that a 95% confidence interval which excludes unity will be considered to be proof that there is a real difference, which is no better than using a p-value threshold. The differing confidence intervals could be used to give an overall estimate of an effect, and its potential ranges.

In this blog I probably sometimes get caught up in the usual patterns of referring to p-values, but usually I try to say something like “not likely to be due to chance alone”, which does not mean that a difference is necessarily due to a real effect of the intervention, but that the data would be unlikely if you picked the numbers at random out of a soup of numbers. All sorts of things might cause a p-value to be less than 0.05 when you compare outcomes between 2 groups with a different intervention, only a minority of which are due to a true impact of the intervention.

One recent paper that I liked was by Doug Altman and a group of co-workers (Greenland S, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-50) they list the many errors that people make when talking about the statistical test results, when I read the list it makes me think of the many similar errors I have read, and probably made myself.

A study with an unknown bias might well provide a “significant” p-value when there is no real effect of the intervention, just as a study with a “non-significant” p-value might report a major advance in medicine.

The authors of that recent paper put it this way :

It is true that the smaller the P value, the more unusual the data would be if every single assumption were correct; but a very small P value does not tell us which assumption is incorrect. For example, the P value may be very small because the targeted hypothesis is false; but it may instead (or in addition) be very small because the study protocols were violated, or because it was selected for presentation based on its small size. Conversely, a large P value indicates only that the data are not unusual under the model, but does not imply that the model or any aspect of it (such as the targeted hypothesis) is correct; it may instead (or in addition) be large because (again) the study protocols were violated, or because it was selected for presentation based on its large size.

There have been recent publications suggesting that the critical P-value should be shifted to a much smaller number (such as p<0.005), particularly for epidemiological, rather than interventional studies. But I think that will just shift the problem, and will make it harder to find really useful beneficial effects, or to potentially harmful results.

Abandoning the term “statistically significant” should be enforced, and will force us to makes more nuanced and reasonable evaluations of our data.

About Keith Barrington

I am a neonatologist and clinical researcher at Sainte Justine University Health Center in Montréal

View all posts by Keith Barrington →