Bioinformatics with Python

Updated May 2026
Python is the primary scripting language for bioinformatics because Biopython provides tools for parsing biological file formats, searching sequence databases, performing alignments, and analyzing protein structures, while the broader scientific Python ecosystem (pandas, NumPy, matplotlib, scikit-learn) handles the statistical analysis and visualization that bioinformatics constantly requires. From processing raw sequencing reads to building phylogenetic trees and analyzing gene expression, Python connects every step of the bioinformatics workflow.

Bioinformatics deals with biological data at computational scale: genomes with billions of base pairs, proteomes with tens of thousands of proteins, expression datasets with measurements across thousands of genes and hundreds of conditions. The data is stored in specialized file formats (FASTA, FASTQ, GenBank, PDB, VCF, BAM, GFF) that general-purpose tools cannot parse. Biopython provides parsers for all of these formats, converting them into Python objects that can be manipulated, searched, and analyzed. Combined with pandas for tabular data and matplotlib for visualization, Python handles the entire pipeline from raw biological data to publishable figures and statistical results.

Step 1: Work with Biological Sequences

Biopython represents sequences as Seq objects with biologically aware operations. from Bio.Seq import Seq, then dna = Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG'). dna.complement() returns the complementary strand. dna.reverse_complement() returns the reverse complement used for the antisense strand. dna.transcribe() converts DNA to mRNA (T to U). dna.translate() converts to a protein sequence using the standard genetic code. These operations handle the biology correctly: translate() reads codons (triplets), respects start codons, and stops at stop codons.

Parsing sequence files uses SeqIO, Biopython's unified file I/O module. from Bio import SeqIO. For FASTA files: records = list(SeqIO.parse('sequences.fasta', 'fasta')). Each record has .id (identifier), .description (full header line), and .seq (the sequence as a Seq object). For GenBank files: records = list(SeqIO.parse('genome.gb', 'genbank')). GenBank records include .features (genes, CDS, regulatory elements with their locations and qualifiers), .annotations (organism, taxonomy, references), and .seq. SeqIO handles format conversion: SeqIO.convert('input.gb', 'genbank', 'output.fasta', 'fasta') converts between formats in one line.

Sequence properties provide basic characterization. len(seq) gives the sequence length. seq.count('ATG') counts occurrences of a motif. GC content: from Bio.SeqUtils import gc_fraction, then gc = gc_fraction(seq). Molecular weight: from Bio.SeqUtils import molecular_weight, then mw = molecular_weight(seq, seq_type='DNA'). For protein sequences, ProteinAnalysis computes amino acid composition, molecular weight, isoelectric point, secondary structure predictions, and hydrophobicity profiles: from Bio.SeqUtils.ProtParam import ProteinAnalysis, analysis = ProteinAnalysis(str(protein_seq)).

Working with large sequence collections requires efficient iteration rather than loading everything into memory. SeqIO.parse() returns a generator that reads one record at a time. Filter sequences by criteria: long_seqs = (r for r in SeqIO.parse('input.fasta', 'fasta') if len(r.seq) > 500). Write filtered results: SeqIO.write(long_seqs, 'filtered.fasta', 'fasta'). For very large files (whole genomes, metagenomics datasets), use pysam for indexed BAM/CRAM access and pyfaidx for indexed FASTA access, which retrieve specific regions without reading the entire file.

Step 2: Search Biological Databases

NCBI Entrez provides programmatic access to GenBank, PubMed, and other NCBI databases. from Bio import Entrez, Entrez.email = 'researcher@university.edu' (required by NCBI). Entrez.esearch(db='nucleotide', term='BRCA1[gene] AND human[orgn]') searches for records. Entrez.efetch(db='nucleotide', id='NM_007294', rettype='gb', retmode='text') retrieves a specific record. Always include your email address (NCBI policy), and add time.sleep(0.34) between requests to stay within the 3-requests-per-second limit for unauthenticated users. For high-volume access, register for an NCBI API key which allows 10 requests per second.

BLAST searches find similar sequences in databases. from Bio.Blast import NCBIWWW, NCBIXML. NCBIWWW.qblast('blastn', 'nt', sequence) runs a nucleotide BLAST against the nt database. Parse results: blast_records = NCBIXML.parse(result_handle). For each hit: hit.title (description), hit.accession (database ID), hsp.score (alignment score), hsp.expect (E-value), hsp.identities (number of identical positions), hsp.query (aligned query sequence), hsp.sbjct (aligned subject sequence). Filter hits by E-value: significant_hits = [h for h in blast_record.alignments if h.hsps[0].expect < 1e-10].

For local BLAST searches (faster, no network dependency), install BLAST+ command-line tools and call them from Python. from Bio.Blast.Applications import NcbiblastnCommandline. Create a local database: makeblastdb -in reference.fasta -dbtype nucl. Run searches: NcbiblastnCommandline(query='query.fasta', db='reference', outfmt=6, out='results.txt')(). Format 6 (tabular) is the easiest to parse with pandas: results = pd.read_csv('results.txt', sep='\t', header=None, names=['query', 'subject', 'identity', 'length', 'mismatches', 'gaps', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']).

UniProt, Ensembl, and other databases provide REST APIs that Python's requests library accesses directly. requests.get('https://rest.uniprot.org/uniprotkb/P53_HUMAN.json').json() retrieves a protein record. requests.get('https://rest.ensembl.org/lookup/id/ENSG00000141510', headers={'Content-Type': 'application/json'}).json() retrieves gene information. These APIs return structured JSON that converts directly to Python dictionaries, avoiding the need for specialized parsers. For bulk downloads, most databases provide FTP access to complete datasets.

Step 3: Align and Compare Sequences

Pairwise alignment compares two sequences to find the best matching arrangement. from Bio import pairwise2. alignments = pairwise2.align.globalxx(seq1, seq2) for global alignment (align entire sequences) or pairwise2.align.localxx(seq1, seq2) for local alignment (find the best matching subsequence). The xx suffix specifies scoring: globalms uses match/mismatch scores, globalds uses a substitution matrix plus gap penalties. For protein alignments, use the BLOSUM62 matrix: from Bio.Align import substitution_matrices, matrix = substitution_matrices.load('BLOSUM62'), then pairwise2.align.globalds(seq1, seq2, matrix, -10, -0.5) with gap open and extension penalties.

Multiple sequence alignment (MSA) aligns three or more related sequences to identify conserved regions. Biopython interfaces with external MSA tools. from Bio.Align.Applications import ClustalOmegaCommandline. ClustalOmegaCommandline(infile='sequences.fasta', outfile='aligned.fasta', auto=True)() runs Clustal Omega. MUSCLE and MAFFT are faster alternatives for large datasets. Parse the alignment: from Bio import AlignIO, alignment = AlignIO.read('aligned.fasta', 'fasta'). Access columns: alignment[:, 10] gives position 10 across all sequences. Compute conservation: count the most common character at each position to find conserved residues.

Alignment visualization and analysis follows naturally. Compute percent identity between aligned sequences by counting matching positions divided by alignment length. Identify conserved blocks where all sequences agree. Compute a position-specific scoring matrix (PSSM) or sequence logo that shows the information content at each position. Plot conservation profiles with matplotlib: conservation scores along the sequence length, with high peaks indicating functionally important positions. Highlight these conserved regions on protein structures to connect sequence conservation to structural and functional roles.

For genomic-scale alignments (whole chromosomes, thousands of sequences), specialized tools replace Biopython's built-in methods. minimap2 (called from Python via subprocess or pysam) aligns reads to reference genomes at high speed. STAR aligns RNA-seq reads. BWA-MEM2 handles short-read DNA alignment. These tools produce SAM/BAM files that pysam parses in Python: import pysam, samfile = pysam.AlignmentFile('aligned.bam', 'rb'), then iterate over aligned reads with their mapping positions, quality scores, and CIGAR strings describing the alignment.

Step 4: Analyze Genomic Data

Gene expression analysis typically starts with a count matrix: genes as rows, samples as columns, values as read counts from RNA-seq. Load with pandas: counts = pd.read_csv('counts.csv', index_col=0). Normalize counts to account for library size differences: counts_per_million = counts.div(counts.sum()) * 1e6. Log-transform for visualization: log_cpm = np.log2(counts_per_million + 1). Filter lowly expressed genes: expressed = counts.loc[counts.mean(axis=1) > 10]. These preprocessing steps prepare the data for differential expression analysis and clustering.

Differential expression identifies genes whose activity differs between conditions. The pyDESeq2 library (pip install pydeseq2) implements the DESeq2 statistical framework in Python. from pydeseq2.dds import DeseqDataSet, from pydeseq2.ds import DeseqStats. Create the dataset: dds = DeseqDataSet(counts=count_matrix, metadata=sample_info, design_factors='condition'). Run the analysis: dds.deseq2(), stat_res = DeseqStats(dds, contrast=['condition', 'treatment', 'control']), stat_res.summary(). The results include log2 fold changes, p-values, and adjusted p-values (FDR-corrected) for each gene.

Volcano plots are the standard visualization for differential expression results. Plot negative log10 p-value on the y-axis vs log2 fold change on the x-axis. Each point is a gene. Significantly upregulated genes appear in the upper-right, downregulated in the upper-left, unchanged genes cluster near the center. Color significant genes differently: results['color'] = np.where((results['padj'] < 0.05) & (abs(results['log2FoldChange']) > 1), 'red', 'gray'). Label the most significant genes with ax.annotate().

Clustering and heatmaps reveal expression patterns across genes and samples. from scipy.cluster.hierarchy import linkage, dendrogram. Hierarchically cluster genes with similar expression profiles: Z = linkage(expression_matrix, method='ward'). sns.clustermap(expression_matrix, method='ward', cmap='RdBu_r', z_score=0) creates a heatmap with dendrograms on both axes, grouping genes with similar patterns and samples with similar profiles. This visualization reveals gene modules (groups of co-regulated genes) and sample subtypes that may correspond to biological states or experimental conditions.

Step 5: Build Phylogenetic Trees

Phylogenetic trees represent evolutionary relationships between sequences. Start with a multiple sequence alignment, then compute a distance matrix or use a model-based method to infer the tree. from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor. calculator = DistanceCalculator('identity'). dm = calculator.get_distance(alignment). constructor = DistanceTreeConstructor(). tree = constructor.nj(dm) builds a neighbor-joining tree, or tree = constructor.upgma(dm) builds a UPGMA tree. Neighbor-joining is generally preferred because it does not assume a constant rate of evolution.

For more accurate trees, use maximum likelihood methods via external tools. RAxML, IQ-TREE, and PhyML are the standard tools, callable from Python via subprocess. IQ-TREE automatically selects the best substitution model: subprocess.run(['iqtree2', '-s', 'aligned.fasta', '-m', 'TEST', '-bb', '1000']), which tests models, builds the tree, and computes bootstrap support values (1000 replicates). Parse the resulting Newick tree file: from Bio import Phylo, tree = Phylo.read('aligned.fasta.treefile', 'newick').

Tree visualization uses Bio.Phylo or the ete3 library. Phylo.draw(tree) creates a basic tree plot. For publication-quality trees, ete3 provides extensive customization: from ete3 import Tree, TreeStyle, NodeStyle. t = Tree('tree.nwk'). Style nodes: for node in t.traverse(): ns = NodeStyle(), ns['size'] = 0, node.set_style(ns). Style leaves: for leaf in t.get_leaves(): add colors, labels, or data annotations. Render: t.render('tree.pdf', tree_style=ts). ete3 handles circular layouts, fan layouts, custom color schemes, and annotations that show bootstrap values, branch lengths, or associated data alongside the tree.

Interpreting trees requires understanding what they do and do not show. Branch lengths represent evolutionary distance (substitutions per site), not time, unless a molecular clock is assumed. Internal nodes represent inferred ancestral sequences, not observed organisms. Bootstrap values on branches indicate statistical support: values above 70 to 80% are generally considered reliable. Trees are hypotheses about evolutionary history, not established facts, and different methods or data can produce different topologies. Always report the alignment method, substitution model, tree-building method, and support values so that readers can evaluate the reliability of your phylogenetic conclusions.

Key Takeaway

Biopython handles biological file formats and database access, while pandas, NumPy, and matplotlib handle the statistical analysis and visualization. The combination covers the full bioinformatics workflow from raw sequences to publication figures.