What Is Statistical Significance: Understanding the P < 0.05 Threshold

Updated June 2026
Statistical significance means that an observed result is unlikely to have occurred by chance alone if the null hypothesis were true. Conventionally, a result is declared statistically significant when the p-value falls below a predetermined threshold (usually 0.05), indicating that the data provides sufficient evidence to reject the null hypothesis. However, statistical significance does not mean the result is large, important, or practically meaningful.

The Mechanics of Significance

Before collecting data, researchers set a significance level (alpha), which defines the maximum probability of a Type I error (rejecting a true null hypothesis) they are willing to accept. The most common choice is alpha = 0.05, meaning a 5% false positive rate is considered acceptable. After collecting data and calculating the test statistic, they compare the resulting p-value to this threshold. If p is less than alpha, the result is statistically significant and the null hypothesis is rejected.

The 0.05 threshold has no deep mathematical justification. Ronald Fisher suggested it as a convenient rule of thumb in the 1920s, noting that deviations exceeding about two standard errors (corresponding to p = 0.05 for large samples) were worth a second look. Over time, this suggestion calcified into a rigid boundary that determines what gets published, what treatments get approved, and what effects are considered real. Many statisticians argue that this binary thinking causes more harm than good.

One-tailed tests evaluate whether an effect exists in a specific direction (e.g., treatment is better than placebo), while two-tailed tests evaluate whether an effect exists in either direction (treatment differs from placebo, whether better or worse). Two-tailed tests are more conservative and generally preferred unless there is a strong theoretical reason to test only one direction. Using a one-tailed test to achieve significance when a two-tailed test would not is a form of p-hacking that inflates the false positive rate.

Different fields use different significance thresholds based on their tolerance for false positives and the costs of different types of errors. Particle physics requires five-sigma significance (roughly p = 0.0000003) before claiming a discovery, reflecting the enormous cost of announcing a new particle that turns out not to exist. Clinical trials for drug approval typically use p = 0.05 but require replication across multiple studies. Genome-wide association studies use p = 0.00000005 to account for the millions of genetic variants tested simultaneously. These varying standards illustrate that the appropriate alpha depends on the context, not on a universal convention.

Statistical Significance vs Practical Significance

The most important distinction in applied statistics is between statistical significance and practical significance (also called clinical significance in medicine). Statistical significance tells you whether an effect exists, however small. Practical significance asks whether the effect is large enough to matter in the real world. These two concepts are independent: a result can be statistically significant but practically meaningless, or practically important but statistically non-significant.

With large enough samples, trivially small effects become statistically significant. A study of 500,000 people might find that coffee drinkers live 0.3 days longer on average, a statistically significant result (p less than 0.001) that no one would consider meaningful for health decisions. Conversely, a small pilot study might observe a large, clinically important treatment effect that fails to reach significance because the sample was too small to provide adequate power to detect it.

Effect sizes bridge this gap by measuring the magnitude of a result independently of sample size. Cohen's d, correlation coefficients, odds ratios, and other effect size measures tell you how large the difference or relationship is, while p-values tell you only how unlikely the data would be under the null hypothesis. Both pieces of information are needed for a complete interpretation. A study reporting p = 0.001 and d = 0.05 tells a very different story from one reporting p = 0.04 and d = 0.80.

The Multiple Testing Problem

When researchers perform multiple statistical tests on the same dataset, the probability of finding at least one significant result by chance alone increases rapidly. Testing 20 independent hypotheses at alpha = 0.05 gives a 64% chance of at least one false positive, even if none of the hypotheses are true. This is the multiple comparisons problem, and it inflates false positive rates far beyond the nominal 5% that each individual test controls.

Several corrections address this problem. The Bonferroni correction divides the alpha level by the number of tests (testing 20 hypotheses at alpha = 0.05/20 = 0.0025 each). This is simple but overly conservative, reducing power substantially when many tests are performed. The Holm-Bonferroni method is a step-down procedure that is uniformly more powerful than Bonferroni while still controlling the family-wise error rate. The Benjamini-Hochberg procedure controls the false discovery rate (the expected proportion of rejected hypotheses that are false positives) rather than the family-wise error rate, offering better power when many tests are conducted and some false positives are tolerable.

Pre-registration, where researchers specify their hypotheses and analyses before seeing the data, provides an alternative approach to the multiple testing problem. By distinguishing confirmatory tests (planned in advance) from exploratory analyses (discovered while examining data), pre-registration allows readers to calibrate how much weight to place on each finding. Exploratory findings are treated as hypothesis-generating rather than hypothesis-confirming, requiring independent replication before being considered established.

The Replication Crisis

Over-reliance on the 0.05 threshold has contributed to a replication crisis across multiple scientific disciplines. Large-scale replication projects have found that 50-75% of published findings in psychology and biomedical research fail to replicate when independent teams attempt them. Several factors explain this failure rate: publication bias (only significant results get published), p-hacking (analyzing data many ways until significance is found), small sample sizes (which produce noisy, unreliable estimates), and the base rate problem (when most tested hypotheses are false, even a 5% false positive rate generates many false positives relative to true positives).

The base rate issue deserves special attention. If only 10% of tested hypotheses are true, and studies have 80% power at alpha = 0.05, then out of 1000 tests: 80 true effects are detected, 45 false positives are generated, and only 64% of significant results actually reflect real effects. This positive predictive value drops further with lower power or lower base rates, which are common in exploratory research.

Reform efforts include pre-registration (specifying hypotheses and analyses before seeing data), registered reports (peer review before data collection), larger sample sizes, reporting of exact p-values and effect sizes, and supplementing or replacing significance testing with Bayesian methods or estimation-based approaches focused on confidence intervals.

Beyond Significance: A Better Approach

Modern best practices treat statistical significance as one piece of evidence, not the final word. A complete reporting of results should include: the test statistic and exact p-value, the effect size with a confidence interval, a statement about whether the effect is large enough to matter practically, consideration of study design quality and potential biases, and acknowledgment of where the results fit within the broader literature. This comprehensive approach prevents the common mistake of reducing complex findings to a single binary decision.

The American Statistical Association's 2016 statement on p-values emphasized six principles: p-values can indicate incompatibility with a model, they do not measure the probability of truth, scientific conclusions should not be based solely on significance, inference requires full reporting, p-values do not measure effect importance, and any single statistic cannot substitute for scientific reasoning. These principles argue for treating statistical analysis as a tool within scientific judgment rather than a mechanical decision procedure.

Some researchers advocate replacing significance testing entirely with estimation approaches that focus on effect sizes and confidence intervals, or with Bayesian methods that quantify evidence for and against hypotheses on a continuous scale. Others advocate retaining significance testing but with reforms: using alpha = 0.005 instead of 0.05 for claims of new discoveries (as proposed by Benjamin et al. in 2018), requiring replication before findings are considered established, and always reporting effect sizes alongside p-values. Regardless of which approach prevails, the era of treating p less than 0.05 as the sole criterion for scientific truth is ending.

Key Takeaway

Statistical significance (p less than 0.05) means the data is unlikely under the null hypothesis, but it says nothing about whether the effect is large or important. Always report effect sizes alongside significance tests, account for multiple comparisons, and remember that the 0.05 threshold is a convention, not a law of nature. The best statistical practice treats significance as one input to scientific judgment rather than the final verdict.