Reproducible Research with Python
The reproducibility crisis affects computational science just as deeply as experimental science. A 2019 survey in Nature found that over 70% of researchers had tried and failed to reproduce another scientist's computational results, and over 50% had failed to reproduce their own. The causes are predictable: code that depends on unlisted library versions, analysis steps that were performed manually and never recorded, random seeds that were not set, data that was modified without documentation, and computational environments that cannot be recreated because the dependencies were never specified. Every one of these problems has a straightforward technical solution in Python.
Step 1: Lock Your Environment
Create a virtual environment for every project: python -m venv .venv or conda create -n project_name python=3.12. Activate it before installing anything. Install packages with pip install or conda install. Record the environment: pip freeze > requirements.txt (pip) or conda env export > environment.yml (conda). These files capture every package and its exact version. Another researcher recreates the environment with pip install -r requirements.txt or conda env create -f environment.yml and gets the identical library versions you used.
Pin versions explicitly rather than relying on ranges. requirements.txt should contain numpy==1.26.4, not numpy>=1.26 or just numpy. Version ranges allow silent upgrades that can change numerical results (different floating-point behavior), break API compatibility, or introduce new bugs. For conda, environment.yml should list explicit versions: numpy=1.26.4. The difference between "I used NumPy" and "I used NumPy 1.26.4" is the difference between an unreproducible claim and a verifiable specification.
Record the Python version itself: python --version >> environment_info.txt. Different Python versions produce different behavior for floating-point operations, hash randomization, and standard library functions. Also record the operating system and hardware architecture: results can differ between Linux and macOS, between x86 and ARM, and between CPU and GPU execution. Include a README that documents these environmental requirements so that differences in numerical results can be diagnosed rather than mysterious.
Step 2: Version Control Everything
Initialize a git repository at the start of every project: git init. Commit early and often: git add analysis.py, git commit -m 'Add initial data loading and cleaning'. Every commit captures a snapshot of the code at a specific point in time, with a message explaining what changed. If a figure in your paper was generated three months ago and you have since modified the analysis, git log and git checkout allow you to recover the exact code version that produced that figure.
Structure commits around logical changes: one commit per analysis step, bug fix, or feature addition. Avoid massive commits that change everything at once ("updated code"). Write descriptive commit messages: "Fix off-by-one error in sample indexing that affected Figure 3" tells a reviewer exactly what changed and why. Tag important milestones: git tag -a v1.0-submitted -m 'Code as submitted to journal' marks the exact state of the code at submission time.
Track code and configuration, but not large data files or generated outputs. Add data/, results/, *.csv, *.h5, __pycache__/ to .gitignore. For data provenance, store the URL or database query that retrieves the data, the download date, and a checksum (hashlib.md5(open('data.csv', 'rb').read()).hexdigest()) that verifies the data has not changed. For large datasets, use Git LFS (Large File Storage) or DVC (Data Version Control), which track data files separately from code while maintaining the connection between specific code versions and the data they used.
Host the repository on GitHub or GitLab for backup, collaboration, and sharing. Public repositories allow reviewers and readers to inspect your code. GitHub's Zenodo integration automatically creates a DOI (Digital Object Identifier) for each repository release, providing a permanent, citable reference to the exact code version associated with a publication. Include the DOI in your paper's methods section or data availability statement.
Step 3: Control Randomness
Set random seeds at the beginning of every analysis script that uses random numbers. np.random.seed(42) or rng = np.random.default_rng(42) fixes the NumPy random number sequence. random.seed(42) fixes the Python standard library random module. For scikit-learn, pass random_state=42 to every estimator and function that accepts it: train_test_split(X, y, random_state=42), RandomForestClassifier(random_state=42), KMeans(random_state=42). For PyTorch: torch.manual_seed(42), torch.cuda.manual_seed_all(42). For TensorFlow: tf.random.set_seed(42).
The specific seed value does not matter (42 is a convention from The Hitchhiker's Guide to the Galaxy), but it must be set and documented. Results should not depend on the particular seed chosen: if changing the seed from 42 to 123 dramatically changes your conclusions, the analysis is not robust and needs more data or a different approach. Report the seed in your methods section so that exact numerical reproduction is possible.
Be aware that some operations are non-deterministic even with seeds set. GPU operations in deep learning can produce slightly different results across runs due to non-deterministic atomic operations. Multi-threaded operations can produce different floating-point summation orders across runs. Hash randomization in Python 3 (PYTHONHASHSEED) affects dictionary and set ordering. For strict reproducibility, set PYTHONHASHSEED=0 as an environment variable and configure deep learning frameworks for deterministic mode (torch.use_deterministic_algorithms(True)), accepting the performance penalty.
Step 4: Automate the Full Pipeline
The gold standard for reproducibility is a single command that transforms raw data into all final outputs (figures, tables, statistics). Structure the pipeline as a Python script or Makefile: python run_analysis.py or make all. The script should: load raw data (never modify raw data files), apply all cleaning and transformation steps, run all analyses, generate all figures, and save all results. If the script runs without errors and produces the same outputs, the analysis is reproducible.
Makefile-based pipelines track dependencies between steps. A Makefile entry like: results/figure1.pdf: data/cleaned.csv scripts/plot_figure1.py\n\tpython scripts/plot_figure1.py specifies that figure1.pdf depends on cleaned.csv and plot_figure1.py. Running make only re-executes steps whose dependencies have changed, saving time during iterative development. Snakemake provides a Python-native alternative with additional features for scientific workflows: conda environment specification per step, cluster execution, and automatic parallelization of independent steps.
Test the pipeline from a clean state periodically. Clone the repository into a fresh directory, create the environment from the specification file, download the data, and run the pipeline. If it fails or produces different results, the pipeline has hidden dependencies (files, paths, environment variables, installed software) that are not captured in the repository. Fix these issues as they arise rather than discovering them during peer review or when a colleague tries to reproduce your work.
Continuous integration (CI) services like GitHub Actions can automatically run your pipeline on every commit. Create a workflow file that installs the environment, downloads test data, runs the analysis, and compares outputs to expected results. This catches reproducibility-breaking changes immediately rather than months later. For long-running analyses, run CI on a subset of the data or on simplified parameters that complete quickly while exercising the same code paths.
Step 5: Package and Share
Docker containers package the entire computational environment (operating system, Python, libraries, code) into a single reproducible image. Create a Dockerfile: FROM python:3.12, COPY requirements.txt ., RUN pip install -r requirements.txt, COPY . /app, WORKDIR /app. Build: docker build -t my-analysis .. Run: docker run my-analysis python run_analysis.py. The container runs identically on any machine with Docker installed, regardless of the host operating system. This eliminates "works on my machine" problems caused by OS-level differences.
Binder (mybinder.org) creates executable environments from GitHub repositories without any installation. Add a requirements.txt or environment.yml to your repository, then construct a Binder URL. Anyone clicking the link gets a live JupyterLab environment with your code, data, and dependencies, running in the cloud, ready to execute. Include a Binder badge in your README and paper. Binder is the lowest-friction path to letting reviewers and readers actually run your code rather than just reading it.
A reproducibility package for journal submission should include: the raw data (or instructions and code to obtain it), all analysis code (scripts and/or notebooks), the environment specification (requirements.txt or environment.yml), a README explaining how to set up the environment and run the analysis, and the expected outputs (figures, tables) for verification. Archive the complete package on Zenodo or Figshare to get a DOI. Reference the DOI in your paper. This package, combined with your published paper, constitutes a complete, independently verifiable scientific record.
Code review for scientific code is as important as peer review for scientific papers. Have a colleague run your pipeline independently before submission. Ask them to check whether the code is readable, whether the analysis steps make sense, whether assumptions are documented, and whether they can reproduce the results. Computational reproducibility is necessary but not sufficient: the code should also be correct, meaning it implements the analysis described in the paper. Code review catches errors that automated testing misses because a human can evaluate whether the logic matches the scientific intent.
Reproducibility requires three things: a locked environment (exact library versions), version-controlled code (complete history), and an automated pipeline (single command from raw data to final outputs). Do these three things and your computational research is verifiable.