Common Statistical Errors: Mistakes That Invalidate Research Findings

Updated June 2026
The most common statistical errors include p-hacking (manipulating analyses to achieve significance), confusing correlation with causation, ignoring multiple comparisons corrections, interpreting non-significant results as proof of no effect, survivorship bias, and conflating statistical significance with practical importance. These errors inflate false discovery rates and produce unreliable findings that fail to replicate.

P-Hacking and Data Dredging

P-hacking refers to the practice of trying multiple analyses, outcome measures, subgroup comparisons, or data exclusion criteria until a statistically significant result emerges. Researchers might run 20 different statistical tests on their data and report only the one that produced p < 0.05, without disclosing the 19 failed attempts. Since each test carries a 5% false positive rate, testing 20 hypotheses yields an expected one false positive even when no true effects exist.

Common forms include: removing outliers selectively (only when it improves the p-value), testing multiple outcome variables and reporting only the significant one, adding or removing covariates until results become significant, stopping data collection as soon as significance is reached, and analyzing subgroups until a significant comparison appears. Pre-registration of hypotheses and analysis plans before data collection is the primary defense against p-hacking.

How does p-hacking differ from legitimate exploratory analysis?
Exploratory analysis is honest about its nature, explicitly labeling findings as hypothesis-generating rather than hypothesis-confirming. P-hacking disguises exploratory results as confirmatory by presenting the final significant result as if it were the only analysis conducted. The solution is transparency: pre-register confirmatory hypotheses and clearly separate them from exploratory findings that require independent replication.

Multiple Comparisons Without Correction

When you test multiple hypotheses simultaneously, the probability of at least one false positive grows rapidly. With 20 independent tests at alpha = 0.05, the probability of at least one false positive is 1 - (0.95)^20 = 64%. This family-wise error rate must be controlled through corrections such as Bonferroni (divide alpha by number of tests), Holm step-down procedure, or false discovery rate (FDR) control using the Benjamini-Hochberg method.

Genomics studies testing thousands of genes simultaneously would produce hundreds of false discoveries without these corrections. Even in smaller-scale research, running separate t-tests for every pair of groups in a multi-group study inflates false positive rates. Using ANOVA with post-hoc tests that incorporate multiple comparisons corrections (Tukey HSD, Bonferroni) properly controls the error rate.

Confusing Correlation with Causation

Observational studies can only establish associations, never causation, regardless of the strength of the correlation or the sophistication of the statistical model. No amount of regression analysis can rule out unmeasured confounders that might explain an observed relationship. Only randomized experiments with appropriate controls can establish causal claims.

Yet researchers routinely use causal language ("X increases Y," "A leads to B") when reporting observational results, misleading readers about the strength of evidence. Media headlines amplify this problem by converting associations into causal claims. "Coffee drinkers live longer" sounds like causation but reflects only a correlation that could be explained by dozens of confounding variables related to lifestyle, income, and health behaviors.

Can statistical methods ever establish causation from observational data?
Some quasi-experimental methods (instrumental variables, regression discontinuity, difference-in-differences) provide stronger causal evidence from observational data by exploiting natural variation that approximates randomization. However, these methods require strong assumptions that may or may not hold in practice. True randomized experiments remain the gold standard for causal inference.

Interpreting Non-Significance as No Effect

Failing to reject the null hypothesis does not confirm it. A non-significant result (p > 0.05) can mean the null is true, but it can equally mean the study lacked adequate power to detect a real effect. A study with 15 participants per group has only 34% power to detect a medium effect (d = 0.5), meaning it will miss such an effect two-thirds of the time.

The confidence interval clarifies the distinction: if it is narrow and centered near zero, evidence supports no meaningful effect. If it is wide and includes both zero and substantial effects, the study was simply too imprecise to draw conclusions. Equivalence testing provides a formal framework for concluding that an effect is negligibly small, by testing whether the effect falls within a pre-specified range of practical equivalence.

What is the difference between "no evidence of effect" and "evidence of no effect"?
No evidence of effect means your study failed to detect anything, possibly due to low power. Evidence of no effect means you had adequate power and the confidence interval excludes all practically meaningful effect sizes. The first is uninformative, the second is a genuine scientific finding. Equivalence tests and confidence intervals distinguish these situations.

Confusing Statistical and Practical Significance

A result can be statistically significant (p < 0.05) while being practically meaningless. With 500,000 observations, a correlation of r = 0.01 is statistically significant but explains only 0.01% of the variance. Conversely, a clinically important treatment effect may not reach statistical significance in a small, underpowered study. Always report effect sizes alongside p-values to distinguish real importance from mere detectability.

The solution is to evaluate both statistical and practical significance together. A study reporting p = 0.001 and d = 0.05 has found a real but trivial effect. A study reporting p = 0.08 and d = 0.75 may have found an important effect that the sample was too small to confirm. Confidence intervals communicate both the estimated magnitude and the precision of the estimate, providing a more complete picture than either the p-value or effect size alone.

Simpson's Paradox

Simpson's paradox occurs when a trend that appears in subgroups reverses when the subgroups are combined. A treatment might appear to be more effective than a placebo in both mild and severe cases separately, yet appear less effective overall because sicker patients disproportionately received the treatment. This paradox arises from confounding variables that affect both the grouping and the outcome, and it demonstrates why controlling for relevant variables is essential in regression and other analyses.

The famous Berkeley gender bias case illustrates this. Overall admission rates appeared to favor men, but examining individual departments revealed that women were admitted at equal or higher rates in most departments. Women disproportionately applied to more competitive departments with lower admission rates, creating the appearance of overall bias that did not exist within any department. The lesson is that aggregate data can produce conclusions that contradict the patterns in disaggregated data.

Survivorship Bias

Survivorship bias occurs when analyses include only observations that "survived" some selection process while ignoring those that did not. Analyzing only successful companies to identify success factors ignores all the failed companies that did the same things. Studying only patients who completed a drug trial ignores those who dropped out due to side effects. The survivors are a biased sample that overestimates positive outcomes and underestimates risks.

The classic example is the World War II bomber analysis, where engineers initially proposed reinforcing the areas of returning planes that showed the most damage. The statistician Abraham Wald recognized that the returning planes represented survivors, and that the areas without damage were actually the most critical because planes hit there never returned. This insight illustrates how focusing only on survivors produces conclusions that are the exact opposite of correct.

Ecological Fallacy and Base Rate Neglect

The ecological fallacy draws conclusions about individuals from group-level data. Countries with higher chocolate consumption have more Nobel laureates, but this does not mean that individuals who eat more chocolate are more likely to win Nobel Prizes. The correlation exists at the country level because wealthy nations have both luxury food consumption and research funding. Inferring individual-level relationships from aggregate data is statistically invalid because group averages hide within-group variation.

Base rate neglect occurs when people ignore the prevalence (base rate) of a condition when interpreting test results. A medical test with 99% sensitivity and 95% specificity sounds highly accurate, but when the disease affects only 1 in 1000 people, a positive result still means only about a 2% chance of actually having the disease. The large number of healthy people producing false positives overwhelms the small number of sick people producing true positives. Bayes theorem correctly handles this calculation by incorporating the prior probability of the condition.

Key Takeaway

Common statistical errors arise from p-hacking, ignoring multiple comparisons, confusing correlation with causation, treating non-significance as proof of no effect, survivorship bias, and base rate neglect. Pre-registration, correction for multiple testing, reporting confidence intervals and effect sizes, and Bayesian reasoning protect against these mistakes. Awareness of these errors is the first step toward avoiding them in your own research.