P, Damned P and Statistics
Poor power, false positives and false negatives crucify the credibility of p<0.05
I’ve been aware of David Colquhoun’s work for almost twenty years now – initially as an undergraduate, in lectures describing the UCL professor’s pioneering work on single ion channel behavior. More recently, I’ve followed his Twitter feed, @david_colquhoun, and enjoyed reading his website, DC’s Improbable Science (www.dcscience.net), particularly for his candid and often excoriating views on metrics, university management, and alternative medicine.
One of the risks of being at the academic coalface for the best part of forty years is the development of a comprehensive understanding of statistics. DC definitely has that, and one of his tweets last month led me to his latest manuscript on arxiv.org; the first line of the abstract states, “If you use P=0.05 to suggest that you have made a discovery, you’ll be wrong at least 30 percent of the time.” The next line raises the stakes further: “If, as is often the case, experiments are underpowered, you’ll be wrong most of the time.”
Rather than try to recapitulate Colquhoun’s workings in the word count-constricted confines of the Editorial page, I’d suggest you read his manuscript and the examples within it (1). The top-line message: underpowered experiments are dangerous – false positives and false negatives accumulate to give wincingly high false discovery rates, and this only increases as (statistical) power decreases. His advice is “if you wish to keep your false discovery rate below 5 percent, you need to use a 3-sigma rule, or to insist on a P-value below 0.001,” concluding with “And *never* use the word ‘significant’.”
If you accept Colquhoun’s argument, lots of things start to make sense. The irreproducible experimental results; the disappointment of that promising drug candidate failing at Phase II; trials where homeopathy actually appeared to work – right down to the newspaper stories that “seem to link almost any nutritional supplement with almost any outcome” (2). They’re all there, because they have peer-reviewed publications to back them up. If you’re not already doing so, perhaps it’s time to view anything that reports a P-value close to 0.05 as “worth another look”, and only start considering results as beginning to be robust when the P-value approaches 0.001.
- D. Colquhoun, “An investigation of the false discovery rate and the misinterpretation of P values”, August 11, 2014.
- J.P.A. Ioannidis, “Implausible results in human nutrition research”, BMJ, 347, f6698 (2013) doi: 10.1136/bmj.f6698.
I spent seven years as a medical writer, writing primary and review manuscripts, congress presentations and marketing materials for numerous – and mostly German – pharmaceutical companies. Prior to my adventures in medical communications, I was a Wellcome Trust PhD student at the University of Edinburgh.