Problems with p

Holmes and Peirce

(these slides available from www.peirce.org.uk/talks/p-hack )



  • Perceived problems
  • Proper problem:
  • p-hacking: Popular topics
  • p-hacking: exPerimenter degrees of freedom
  • p-hacking: Potential solutions

This is hopefully going to be a discussion. Nick and I don’t necessarily have the answers!


Perceived versus Proper Problems


Parametric versus non-parametric stats

  • Will it make any difference?
  • Are your data so close to p=0.05 that the tests give different answers?

p is not what you think

  • Many (even well-educated scientists) don’t know understand p
  • Nick will cover this later
  • Then again, have you ever come across a really good example where it matters?!

Proper Problems

I don’t think either of those problems have led to large-scale errors in reported findings


Proper Problems

These really have impacted “findings”:
  • Scientists are not impartial
  • Any one dataset can be analysed in more than one way
  • (Many) journals only publish “exciting” results
  • Hypothesis testing was never designed to cope with these issues

Most reported positive results are false alarms (Horton 2015 ; Ioannidis, 2005 ; Harris 2017 )


p is meaningless in exploratory research

  • We know that we should correct for “Family-wise” error
  • What defines a “Family” in this case?
  • Do all studies have the same Family-wise error?

One ‘is allowed’ to apply statistical tests in exploratory research, just as long as one realizes that they do not have evidential impact. De Groot (1956), translated by Wagenmakers et al (2014)

p only makes senses when there was a single way to analyse the data, that you decided on beforehand.


Places of increased family-wise error


Places of increased family-wise error

“This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” Brian Wansink, The grad student that never said ‘no’

  • Postdocs everywhere: labs with lots of staff
    • shall we rerun the study?
    • what if two studies have disagreeing results?
    • then again, are we saying that a study can’t be repeated?

Potential solutions

For discussion’s sake (I’m not saying I agree with these)


Playing with Bayes?

Maybe Bayesian statistics get us out of this (e.g. see Wagenmakers)



Prevent p testing (in all forms)


PsychFileDrawer.org uploads

We could upload all our null results to PsychFileDrawer.org

  • That would get rid of the imbalance between positive/null results
  • Will it, though? Will people read the null findings in Psych File Drawer?

rePlication studies

We could encourage and enable replications to be conducted

  • make it possible to publish a replication (and a non-replication)
  • need to make sure that failed replications were well-run
  • how do we encourage this pursuit?


e.g. pre-register on Open Science Framework


p reduction

If the core problem is that too many studies are false positives, maybe we should reduce alpha to be 0.005…?