ANOVA Explained Simply: Comparing Means Across Multiple Groups

Updated June 2026
Analysis of Variance (ANOVA) is a statistical method that tests whether the means of three or more groups differ significantly. Rather than running multiple t-tests (which inflates the Type I error rate), ANOVA uses a single F-test to determine if at least one group mean is different from the others, then post-hoc tests identify which specific groups differ.

Why Not Just Use Multiple T-Tests?

If you have three groups (A, B, C), comparing all pairs requires three separate t-tests (A vs B, A vs C, B vs C). Each test carries a 5% chance of a false positive. Running all three raises the family-wise error rate to approximately 14%, meaning you have a 14% chance of declaring at least one difference significant when none actually exists. With five groups, you need 10 pairwise comparisons and the error rate exceeds 40%. ANOVA solves this by testing all groups simultaneously in a single procedure that maintains the error rate at your chosen alpha level.

How ANOVA Works: The F-Statistic

ANOVA compares two sources of variation in your data. Between-group variance measures how much the group means differ from the overall mean (the grand mean). If the treatment has an effect, group means should spread apart, producing large between-group variance. Within-group variance measures how much individual observations scatter around their own group mean. This represents random noise or individual differences unrelated to group membership.

The F-statistic is the ratio of between-group variance to within-group variance:

F = (between-group variance) / (within-group variance)

When there is no real difference between groups, the F-ratio should be close to 1 because both variance sources reflect only random variation. When group means truly differ, the between-group variance exceeds within-group variance, producing an F-ratio substantially greater than 1. The larger the F-value, the stronger the evidence that at least one group differs from the others. An F-value of 1.2 provides little evidence, while an F-value of 8.5 provides strong evidence, though the critical threshold depends on degrees of freedom and the chosen significance level.

The F-statistic follows an F-distribution with two degrees of freedom parameters: df1 (number of groups minus 1) and df2 (total sample size minus number of groups). You compare your calculated F to the critical value from this distribution at your chosen alpha level, or equivalently, compute the p-value associated with your F-statistic.

One-Way ANOVA

One-way ANOVA involves a single categorical independent variable (factor) with three or more levels (groups). Examples include comparing test scores across three teaching methods, comparing blood pressure across four drug dosages, or comparing crop yields across five fertilizer types. The null hypothesis states that all group means are equal: H0: mu1 = mu2 = mu3 = ... = muk. The alternative states that at least one mean differs.

Assumptions of one-way ANOVA include: independence of observations (each person or unit appears in only one group), normality (the dependent variable is approximately normally distributed within each group), and homogeneity of variances (the variance within each group is roughly equal). ANOVA is robust to moderate violations of normality, especially with larger sample sizes, thanks to the central limit theorem. Levene's test checks the equal variance assumption. When variances are unequal, Welch's ANOVA provides a valid alternative.

Post-Hoc Tests

A significant ANOVA result tells you that at least one group differs but does not identify which specific pairs of groups are different. Post-hoc (after the fact) tests make pairwise comparisons while controlling the family-wise error rate.

Tukey's HSD (Honestly Significant Difference) is the most common post-hoc test. It compares all possible pairs of group means and controls the overall Type I error rate at alpha. It works best when group sizes are equal or approximately equal.

Bonferroni correction divides the significance threshold by the number of comparisons. With three groups (three comparisons), each pairwise test uses alpha = 0.05/3 = 0.0167. This approach is conservative (may miss real differences) but simple and applicable to any set of planned comparisons.

Scheffe's method is the most conservative post-hoc test, controlling the error rate for all possible contrasts (not just pairwise comparisons). Use it when you want to examine complex comparisons such as "Does the average of groups A and B differ from group C?"

Two-Way and Factorial ANOVA

Two-way ANOVA examines two factors simultaneously and their interaction. For example, studying the effects of both teaching method (lecture, discussion, lab) and class size (small, medium, large) on exam scores. The analysis tests three hypotheses: does teaching method affect scores (main effect 1)? Does class size affect scores (main effect 2)? Does the effect of teaching method depend on class size (interaction)?

Interactions are often the most interesting finding. An interaction means the effect of one factor is not constant across levels of the other factor. Perhaps the discussion method excels in small classes but performs poorly in large ones, while the lecture method works equally well regardless of class size. Without testing for interactions, you might report a misleading average effect that applies to neither context accurately.

Repeated-Measures ANOVA

When the same subjects are measured multiple times (before treatment, during treatment, after treatment), observations are not independent because each person serves as their own control. Repeated-measures ANOVA accounts for this dependence by partitioning out the variance due to individual differences. This design is more powerful than between-subjects designs because it removes person-to-person variability from the error term, making it easier to detect real treatment effects.

The additional assumption of sphericity requires that the variances of differences between all pairs of conditions are equal. Mauchly's test checks this assumption. When violated, corrections such as Greenhouse-Geisser or Huynh-Feldt adjust the degrees of freedom downward, making the test more conservative.

ANCOVA and Extensions

Analysis of Covariance (ANCOVA) combines ANOVA with regression by including one or more continuous covariates alongside the categorical grouping factor. For example, when comparing test scores across three teaching methods, you might include prior GPA as a covariate to control for pre-existing differences in student ability. ANCOVA adjusts the group means for the covariate, producing a fairer comparison. The assumptions are the same as ANOVA plus linearity of the covariate-outcome relationship and homogeneity of regression slopes (the covariate has the same relationship with the outcome in each group).

MANOVA (Multivariate ANOVA) extends the framework to simultaneously test group differences on multiple dependent variables. Rather than running separate ANOVAs for exam score, homework completion, and class participation, MANOVA tests whether the groups differ across all three outcomes considered together. This approach controls the overall error rate and can detect patterns of group differences that individual ANOVAs would miss.

Practical Workflow for ANOVA

A typical ANOVA analysis follows this sequence: state the hypotheses, check that assumptions are met (normality via Shapiro-Wilk test, equal variances via Levene test), run the omnibus F-test, examine effect sizes, perform post-hoc comparisons if the F-test is significant, and report all results together. If normality is seriously violated, switch to the Kruskal-Wallis test. If equal variance is violated, use Welch ANOVA with Games-Howell post-hoc tests. Plotting group means with error bars helps communicate results to audiences unfamiliar with F-statistics.

Effect Size in ANOVA

The effect size for ANOVA is commonly reported as eta-squared or partial eta-squared, which represent the proportion of total variance explained by the factor. Eta-squared values of 0.01, 0.06, and 0.14 correspond to small, medium, and large effects by conventional benchmarks. Cohen's f is another effect size measure for ANOVA, with values of 0.10, 0.25, and 0.40 considered small, medium, and large. Always report effect sizes alongside F-statistics and p-values to communicate the practical importance of group differences. In published research, a complete ANOVA report looks like: F(2, 87) = 6.34, p = 0.003, partial eta-squared = 0.13, indicating a medium-to-large effect with strong statistical evidence. Without the effect size, readers cannot judge whether the statistically significant difference is large enough to be practically meaningful. A study with 1000 participants per group might find p = 0.001 for a trivially small difference that no one would act upon.

Key Takeaway

ANOVA tests whether three or more group means differ by comparing between-group variance to within-group variance using the F-statistic. A significant result requires post-hoc tests to identify which specific groups differ. Check assumptions of normality and equal variances, and always report effect sizes alongside significance tests.