How to Plan Sample Size for Your Experiment

Updated May 2026
Sample size planning determines how many participants, observations, or experimental units you need to detect a meaningful effect with adequate statistical power. Getting this number right before you start collecting data is one of the most important decisions in experimental design, because an underpowered study wastes resources while an overpowered study wastes participants.

Most published studies in the biomedical and social sciences are underpowered. A landmark 2017 meta-analysis found that the median power of studies in these fields was approximately 0.36, meaning fewer than four in ten studies had a reasonable chance of detecting the effects they were designed to find. Proper sample size planning prevents this problem by linking the number of observations directly to the statistical question being asked.

Define Your Research Question and Design

Before calculating anything, you need to know exactly what comparison you are making. Are you comparing two groups (treatment versus control) or more than two? Is your design between-subjects, where each participant is in only one condition, or within-subjects, where each participant experiences all conditions? The answer determines which statistical test you will use, and the statistical test determines the sample size formula.

A between-subjects design comparing two groups on a continuous outcome calls for an independent-samples t-test. A design with three or more groups uses ANOVA. A within-subjects design uses a paired t-test or repeated-measures ANOVA. Each test has different power characteristics, meaning the same effect size requires different sample sizes depending on the design. Within-subjects designs are inherently more powerful because they eliminate between-person variability, often requiring 30 to 50 percent fewer participants than equivalent between-subjects designs.

You also need to decide whether your hypothesis is directional (one-tailed) or non-directional (two-tailed). A one-tailed test predicts the direction of the effect (e.g., the treatment group will score higher), while a two-tailed test allows for effects in either direction. One-tailed tests require smaller samples for the same power, but they cannot detect effects in the unexpected direction. Most researchers use two-tailed tests unless there is a strong theoretical reason to predict the direction.

Estimate the Expected Effect Size

The effect size quantifies how large the difference between conditions is expected to be. Larger effects require smaller samples to detect, and smaller effects require larger samples. This is intuitive: a drug that cures 90 percent of patients is easy to distinguish from a placebo in a small trial, while a drug that improves outcomes by 2 percent requires thousands of patients to detect reliably.

Cohen d is the most common effect size measure for comparing two group means. It expresses the difference between means in standard deviation units. Cohen suggested benchmarks of 0.2 (small), 0.5 (medium), and 0.8 (large), but these are arbitrary conventions. The best effect size estimate comes from pilot studies or published meta-analyses of similar interventions. If no prior data exist, use the smallest effect size that would be practically meaningful, because designing for a smaller effect guarantees you will also detect any larger effect.

For correlation studies, the effect size is the expected correlation coefficient (r). For chi-squared tests of association, it is Cramer V or the odds ratio. For ANOVA designs, it is eta-squared or partial eta-squared. Each measure has its own scale and interpretation, so make sure the effect size metric matches the statistical test you plan to use.

Choose Your Significance Level and Power

The significance level (alpha) is the probability of concluding that an effect exists when it actually does not, a false positive or Type I error. The conventional threshold is 0.05, meaning you accept a 5 percent risk of a false positive. Some fields use stricter thresholds: particle physics requires alpha = 0.0000003 (the five-sigma standard), and genome-wide association studies use 0.00000005 to account for millions of simultaneous comparisons.

Statistical power (1 minus beta) is the probability of detecting a real effect when it exists. Convention targets 0.80, but there is nothing magical about this number. For important decisions, such as clinical trials for life-saving drugs, researchers may target 0.90 or 0.95. Higher power requires larger samples, so the choice involves a trade-off between confidence and cost. A power of 0.80 means that if the true effect is exactly as large as estimated, you will detect it in 80 out of 100 identical experiments.

Run the Power Analysis

With the design, effect size, alpha, and target power specified, you can calculate the required sample size using dedicated software. G*Power is a free, widely used desktop application that handles most common designs. R has the pwr package for basic calculations and the simr package for simulation-based power analysis of complex designs. Python users can use the statsmodels library. Many online calculators also exist for simple two-group comparisons.

For an independent-samples t-test with d = 0.5, alpha = 0.05, and power = 0.80, the required sample size is 64 per group (128 total). For d = 0.3 (a small effect), it jumps to 176 per group (352 total). For d = 0.8 (a large effect), it drops to 26 per group (52 total). These numbers make clear why effect size estimation is so critical: a small change in the expected effect size can double or halve the required sample.

For more complex designs, such as mixed-model ANOVA, multilevel models, or survival analysis, closed-form power formulas may not exist. In these cases, simulation-based power analysis generates thousands of synthetic datasets with the expected effect size, analyzes each one, and calculates the proportion that produces a significant result. This approach is flexible but computationally intensive and requires careful specification of the data-generating model.

Adjust for Practical Constraints

The power analysis gives you the ideal sample size, but real experiments face dropout, non-compliance, and missing data. If you expect 15 percent of participants to drop out before completing the study, inflate your initial enrollment by approximately 18 percent (divide the required sample by 0.85) to ensure you end up with enough complete cases for analysis.

Multiple comparisons also affect sample size requirements. If you plan to compare three groups with pairwise t-tests, you will conduct three comparisons, and the family-wise error rate will exceed 0.05 unless you apply a correction like Bonferroni (dividing alpha by the number of comparisons). A Bonferroni correction for three comparisons uses alpha = 0.0167 per test, which requires a larger sample to maintain the same power. Factor this adjustment into your power analysis from the start.

Budget and logistics set hard upper limits on sample size. If the power analysis says you need 500 participants but your lab can only recruit 200, you have three options: increase the effect size by using a stronger manipulation, reduce measurement noise by using more precise instruments, or switch to a more efficient design (such as within-subjects or ANCOVA with a pre-test covariate). What you should not do is proceed with an underpowered study and hope for the best.

Key Takeaway

Always calculate your required sample size before collecting data. The four inputs you need are the statistical test, the expected effect size, the significance level, and the target power. Get these right, and your experiment will be properly sized to answer the question it asks.