Statistics in Python
Statistical analysis in Python follows a consistent pattern: load your data into a pandas DataFrame, compute descriptive statistics to understand what you have, check assumptions (normality, equal variances), choose the appropriate test, run it, compute effect sizes alongside p-values, and visualize the results. Each step uses specific functions from the three core libraries, and the output at each step informs the next. The key advantage over point-and-click statistical software is that every step is recorded in code, making the entire analysis reproducible and auditable.
Step 1: Compute Descriptive Statistics
pandas provides descriptive statistics for DataFrames. df.describe() computes count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for all numeric columns. df['column'].mean(), .median(), .std(), .var(), .min(), .max(), .sum() compute individual statistics. df['column'].quantile([0.25, 0.75]) returns specific percentiles. df['column'].mode() returns the most frequent value. df.groupby('condition').describe() computes all summary statistics broken down by group, the first table you should generate in any experimental analysis.
NumPy handles statistics on raw arrays without pandas overhead. np.mean(data), np.median(data), np.std(data, ddof=1) compute sample statistics (ddof=1 for sample standard deviation using N-1 denominator). np.percentile(data, [2.5, 97.5]) returns the 2.5th and 97.5th percentiles, useful for non-parametric confidence intervals. np.corrcoef(x, y) returns the correlation matrix. For weighted statistics, np.average(data, weights=w) computes the weighted mean.
scipy.stats adds more specialized descriptive statistics. stats.describe(data) returns count, minmax, mean, variance, skewness, and kurtosis in a single call. stats.sem(data) computes the standard error of the mean. stats.iqr(data) computes the interquartile range. stats.trim_mean(data, 0.1) computes the mean after trimming the top and bottom 10% of values, a robust alternative to the mean that is less sensitive to outliers. stats.zscore(data) standardizes data to zero mean and unit variance.
Distribution visualization complements numerical summaries. Plot histograms (plt.hist or sns.histplot) to see the shape of distributions. Plot box plots (sns.boxplot) to compare distributions across groups. Plot Q-Q plots (stats.probplot(data, plot=plt)) to assess normality visually: if the points fall on a straight line, the data is approximately normally distributed. These visual checks are essential because many statistical tests assume normally distributed data, and violations of this assumption affect the validity of results.
Step 2: Run Hypothesis Tests
Two-sample t-tests compare means between independent groups. stats.ttest_ind(group1, group2) returns (t_statistic, p_value). If p_value < 0.05 (the conventional threshold), you reject the null hypothesis that the means are equal. For groups with unequal variances, use stats.ttest_ind(group1, group2, equal_var=False), which applies Welch's t-test instead of Student's t-test. For paired data (same subjects measured under two conditions), use stats.ttest_rel(condition1, condition2). For testing whether a sample mean differs from a specific value, use stats.ttest_1samp(data, expected_value).
Non-parametric tests handle data that violates normality assumptions. stats.mannwhitneyu(group1, group2) is the non-parametric equivalent of the independent t-test, comparing the distributions of two groups without assuming normality. stats.wilcoxon(condition1 - condition2) is the non-parametric paired test. stats.kruskal(group1, group2, group3) is the non-parametric equivalent of one-way ANOVA. Use non-parametric tests when sample sizes are small (under 30), distributions are clearly skewed, or data is ordinal rather than continuous.
ANOVA compares means across three or more groups. stats.f_oneway(group1, group2, group3) performs one-way ANOVA, returning (F_statistic, p_value). A significant result means at least one group differs but does not identify which pairs differ. Follow up with pairwise t-tests or Tukey's HSD test (from statsmodels: pairwise_tukeyhsd(data, groups)) to identify specific group differences. For factorial designs (two or more independent variables), statsmodels provides anova_lm for Type I, II, or III sums of squares.
Chi-squared tests analyze categorical data. stats.chi2_contingency(contingency_table) tests whether two categorical variables are independent, given a 2D contingency table (created by pd.crosstab(df['var1'], df['var2'])). The function returns (chi2_statistic, p_value, degrees_of_freedom, expected_frequencies). stats.chisquare(observed_frequencies, expected_frequencies) tests whether observed frequencies match expected frequencies (goodness-of-fit test). stats.fisher_exact(table_2x2) is preferred over chi-squared for 2x2 tables with small cell counts (any expected count below 5).
Step 3: Build Regression Models
statsmodels provides regression with detailed output tables that match what researchers expect from statistical software. import statsmodels.api as sm, then model = sm.OLS(y, sm.add_constant(X)).fit() fits an ordinary least squares regression. model.summary() prints a comprehensive table with coefficients, standard errors, t-statistics, p-values, R-squared, adjusted R-squared, F-statistic, AIC, BIC, and residual diagnostics. The formula interface is more readable: import statsmodels.formula.api as smf, then model = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit() specifies the model using R-style formulas with interaction terms.
Regression diagnostics verify that model assumptions hold. model.resid gives residuals. Plot residuals vs. fitted values (plt.scatter(model.fittedvalues, model.resid)) to check for heteroscedasticity (non-constant variance). stats.shapiro(model.resid) tests residual normality. model.get_influence().summary_frame() identifies influential observations with Cook's distance. Variance inflation factors (VIF) detect multicollinearity: from statsmodels.stats.outliers_influence import variance_inflation_factor. If diagnostics reveal violations, consider transforming variables (log, square root), using robust standard errors (model.get_robustcov_results()), or switching to generalized linear models.
Logistic regression models binary outcomes. smf.logit('outcome ~ predictor1 + predictor2', data=df).fit() fits a logistic regression. model.summary() reports log-odds coefficients, standard errors, z-statistics, and p-values. np.exp(model.params) converts log-odds coefficients to odds ratios for interpretation: an odds ratio of 2.5 means the odds of the outcome are 2.5 times higher for each unit increase in the predictor. model.pred_table() shows the confusion matrix. For count outcomes, use smf.poisson() or smf.negativebinomial().
Mixed-effects models handle hierarchical and repeated-measures data. smf.mixedlm('response ~ treatment', data=df, groups=df['subject']).fit() fits a linear mixed model with random intercepts for each subject. This properly accounts for within-subject correlation that regular regression ignores. Add random slopes with re_formula='~treatment' to allow the treatment effect to vary across subjects. Mixed models are the correct approach for experimental designs where multiple measurements come from the same subjects, sites, or batches.
Step 4: Calculate Effect Sizes and Power
P-values alone are insufficient for interpreting results. A statistically significant p-value (p < 0.05) with a tiny effect size means the result is real but unimportant. A non-significant p-value (p > 0.05) with a large effect size means you may simply need more data. Always report effect sizes alongside p-values. Cohen's d for mean differences: d = (mean1 - mean2) / pooled_std, where pooled_std = np.sqrt((std1**2 + std2**2) / 2). Conventions: d = 0.2 is small, d = 0.5 is medium, d = 0.8 is large, but these are rough guidelines that vary by field.
Correlation coefficients serve as effect sizes for relationships. stats.pearsonr(x, y) returns both the correlation and p-value. R-squared (the square of Pearson's r) tells you the proportion of variance in y explained by x: r = 0.5 means R-squared = 0.25, so x explains 25% of the variance in y. For regression models, model.rsquared and model.rsquared_adj provide the overall effect size. Eta-squared for ANOVA: SS_between / SS_total, computable from the ANOVA table as F * df_between / (F * df_between + df_within).
Confidence intervals quantify the uncertainty in your estimates. For means: stats.t.interval(0.95, df=n-1, loc=mean, scale=sem) returns the 95% confidence interval. For regression coefficients: model.conf_int(alpha=0.05) returns 95% confidence intervals for all coefficients. For non-parametric confidence intervals on the median: use bootstrap resampling. Confidence intervals are more informative than p-values because they show both the direction and magnitude of the effect along with its precision.
Power analysis determines the sample size needed to detect an effect of a given size. The statsmodels.stats.power module provides calculators: TTestIndPower().solve_power(effect_size=0.5, alpha=0.05, power=0.8) returns the required sample size per group to detect a medium effect (d = 0.5) with 80% power at 5% significance. For a study that has already been completed, compute post-hoc power by plugging in the observed effect size and sample size. Underpowered studies (power below 0.8) have a high risk of missing real effects, and their significant results tend to overestimate effect sizes.
Step 5: Handle Multiple Comparisons
Running multiple statistical tests inflates the false positive rate. If you run 20 independent tests at alpha = 0.05, you expect one false positive even when there are no real effects. Multiple comparison corrections adjust p-values to control the overall error rate. statsmodels.stats.multitest.multipletests(p_values, method='bonferroni') applies the Bonferroni correction (multiply each p-value by the number of tests), the most conservative approach. method='holm' applies the Holm step-down correction, which is uniformly more powerful than Bonferroni and should be preferred in most cases.
False Discovery Rate (FDR) control is less conservative than family-wise error rate methods like Bonferroni, making it appropriate for exploratory analyses with many tests (genomics, neuroimaging, feature selection). multipletests(p_values, method='fdr_bh') applies the Benjamini-Hochberg procedure, which controls the expected proportion of false positives among the rejected hypotheses rather than the probability of any false positive. FDR at 0.05 means that among all tests called significant, you expect at most 5% to be false positives.
Post-hoc tests for ANOVA handle pairwise comparisons after a significant omnibus test. pairwise_tukeyhsd(data, groups, alpha=0.05) performs Tukey's HSD test, which controls the family-wise error rate for all pairwise comparisons. The result includes mean differences, confidence intervals, and adjusted p-values for every pair. Dunnett's test (available in scipy.stats as dunnett) compares each treatment group against a single control group, which is more powerful than Tukey when a control group comparison is the primary interest.
Pre-registration and hypothesis specification reduce the multiple comparisons problem at the design stage. Decide which tests you will run before collecting data, register your analysis plan, and distinguish between confirmatory (pre-specified) and exploratory (data-driven) analyses in your paper. Confirmatory analyses use conventional significance thresholds. Exploratory analyses use FDR correction and are reported as hypothesis-generating rather than hypothesis-confirming. This framework is scientifically honest and increasingly required by journals and funding agencies.
Always report effect sizes and confidence intervals alongside p-values. A result that is statistically significant but has a tiny effect size is rarely worth acting on, and a non-significant result with a wide confidence interval simply means you need more data.