Parallel Computing in Python
Python's Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously, which means threading does not speed up CPU-bound code. This is a frequent source of confusion: adding threads to a computational loop makes it slower, not faster, because of the GIL overhead. Multiprocessing (separate Python processes, each with its own GIL) is the solution for CPU-bound work. Threading works for I/O-bound work because threads release the GIL during I/O operations (network calls, file reads, database queries). NumPy, SciPy, and other C-extension libraries also release the GIL during their internal computations, so multi-threaded NumPy operations do benefit from threading.
Step 1: Identify the Bottleneck
Profile before parallelizing. %timeit expression in Jupyter measures execution time. import cProfile, cProfile.run('my_function()') identifies which functions consume the most time. line_profiler (pip install line_profiler) shows time per line within a function: decorate with @profile, then run kernprof -l -v script.py. memory_profiler (pip install memory-profiler) shows memory usage per line. These tools reveal whether your code is CPU-bound (computation dominates), I/O-bound (waiting for files/network dominates), or memory-bound (swapping to disk dominates).
CPU-bound code spends most of its time in computation: numerical loops, matrix operations, simulation steps, image processing. Signs: CPU usage is near 100% on one core while others are idle. Solution: multiprocessing (distribute work across cores) or Numba/Cython (compile the hot loop to machine code). I/O-bound code spends most of its time waiting: downloading files, querying APIs, reading from disk, writing results. Signs: CPU usage is low even during execution. Solution: threading or asyncio (overlap multiple I/O operations).
Before adding complexity with parallelism, check whether vectorization alone solves the problem. Replacing a Python for-loop with a NumPy vectorized operation often provides a 100x speedup on a single core, which may be sufficient. NumPy, SciPy, and pandas internally use multi-threaded BLAS libraries for matrix operations, meaning np.dot(A, B) for large matrices already uses all CPU cores without any explicit parallelization code. Check whether your NumPy installation uses a multi-threaded BLAS: np.show_config() shows the linked BLAS library (OpenBLAS and MKL are both multi-threaded by default).
Step 2: Use Multiprocessing for CPU Work
concurrent.futures.ProcessPoolExecutor provides the simplest interface for parallel CPU work. from concurrent.futures import ProcessPoolExecutor. with ProcessPoolExecutor(max_workers=4) as executor: results = list(executor.map(process_item, items)). This distributes items across 4 worker processes and collects results in order. Each worker runs in a separate Python process with its own GIL, achieving true parallelism on multi-core CPUs. max_workers defaults to the number of CPU cores (os.cpu_count()), which is usually optimal for CPU-bound work.
The executor.submit() method provides more control than map(). future = executor.submit(process_item, item) starts a single task and returns a Future object. future.result() blocks until the result is available. future.done() checks completion without blocking. as_completed(futures) yields futures as they complete, enabling processing of results in completion order rather than submission order. This pattern is useful when tasks have varying execution times and you want to process results as soon as they are available.
Data serialization limits what can be passed between processes. Arguments and return values are serialized with pickle, which handles most Python objects but fails on lambda functions, open file handles, and database connections. Large data should not be passed as arguments because serialization overhead dominates for arrays larger than about 10 MB. Instead, have each worker load its own data (pass file paths rather than data), or use shared memory: from multiprocessing import shared_memory, shm = shared_memory.SharedMemory(create=True, size=array.nbytes) creates memory accessible to all processes without copying.
Pool-based patterns handle common parallel workloads. For embarrassingly parallel tasks (processing independent files), map() is sufficient. For reduction tasks (compute partial results, then combine), use map() to compute partials, then reduce in the main process: partial_sums = list(executor.map(compute_partial, chunks)), total = sum(partial_sums). For producer-consumer patterns (one process generates work, others consume it), use a multiprocessing.Queue to pass items between processes.
Step 3: Use Threading for I/O Work
concurrent.futures.ThreadPoolExecutor has the same interface as ProcessPoolExecutor but uses threads instead of processes. from concurrent.futures import ThreadPoolExecutor. with ThreadPoolExecutor(max_workers=20) as executor: results = list(executor.map(download_file, urls)). Twenty threads can download 20 files simultaneously because each thread releases the GIL while waiting for network I/O. The speedup is dramatic: downloading 100 files that each take 1 second sequentially takes 100 seconds but only about 5 seconds with 20 threads.
Thread counts for I/O work can be much higher than CPU core counts because threads spend most of their time waiting, not computing. For network requests, 20 to 50 threads is typical. For disk I/O, 4 to 8 threads is usually optimal (limited by disk throughput). For database queries, match the database connection pool size. Too many threads waste memory and context-switching overhead. Too few underutilize the I/O bandwidth. Start with 20 for network tasks and adjust based on measured throughput.
asyncio provides an alternative to threading for I/O-bound work using cooperative multitasking. import asyncio, import aiohttp. async def fetch(session, url): async with session.get(url) as response: return await response.text(). async def main(): async with aiohttp.ClientSession() as session: tasks = [fetch(session, url) for url in urls], results = await asyncio.gather(*tasks). asyncio.run(main()). asyncio uses a single thread with an event loop, avoiding the memory overhead of many threads. It is preferred for applications making thousands of concurrent network requests (web scraping, API clients, microservices).
Step 4: Scale with Dask
Dask parallelizes pandas and NumPy operations on datasets larger than memory. import dask.dataframe as dd. ddf = dd.read_csv('huge_dataset_*.csv') creates a Dask DataFrame that lazily references many CSV files. ddf.groupby('category').mean().compute() reads, groups, and averages across all files using all CPU cores, even if the total data is larger than RAM. Dask breaks the computation into chunks, processes each chunk independently, and combines the results. The .compute() call triggers actual execution; without it, Dask only builds the computation graph.
Dask arrays parallelize NumPy for large arrays. import dask.array as da. x = da.from_array(large_array, chunks=(1000, 1000)) wraps a NumPy array in chunk-aware Dask. da.random.random((100000, 100000), chunks=(1000, 1000)) creates a 10-billion-element random array without allocating all the memory at once. Operations like x.mean(), x.T @ x, and da.linalg.svd(x) compute chunk-by-chunk. Dask arrays support most NumPy operations with the same syntax, making the transition from NumPy to Dask straightforward for existing code.
Dask.delayed wraps arbitrary Python functions for parallel execution. from dask import delayed. @delayed def process(file): return pd.read_csv(file).describe(). results = [process(f) for f in files]. combined = delayed(pd.concat)(results). combined.compute() runs all processing in parallel. Dask builds a task graph showing dependencies between operations, then executes the graph using multiple threads or processes. Visualize the graph: combined.visualize() produces a diagram showing how tasks depend on each other.
Dask distributed scales to multiple machines. from dask.distributed import Client. client = Client('scheduler_address:8786') connects to a Dask cluster. client = Client() starts a local cluster using all cores. The distributed scheduler handles task assignment, data transfer between workers, fault recovery, and load balancing. Dask clusters run on laptops, HPC clusters (dask-jobqueue), Kubernetes (dask-kubernetes), and cloud platforms (Coiled, Saturn Cloud). The same code that runs locally with Client() runs on a 100-node cluster by changing only the scheduler address.
Step 5: Accelerate with Numba and GPU
Numba compiles Python functions to machine code using LLVM, achieving C-like speed for numerical loops. from numba import njit. @njit def monte_carlo_pi(n): count = 0; for i in range(n): x, y = np.random.random(), np.random.random(); if x*x + y*y < 1: count += 1; return 4 * count / n. The first call compiles the function (takes about 1 second), subsequent calls run at compiled speed (100x faster than pure Python). Numba works best on numerical code with loops over NumPy arrays. It does not support pandas, dictionaries, or most Python standard library functions.
Numba parallel mode distributes loop iterations across CPU cores. @njit(parallel=True) def compute(data): result = np.zeros(len(data)); for i in prange(len(data)): result[i] = expensive_computation(data[i]); return result. prange (parallel range) distributes iterations across cores. This combines the speed of compiled code with the parallelism of multiple cores, achieving speedups of 100x to 1000x over pure Python loops for embarrassingly parallel numerical computations.
GPU computing with CuPy provides NumPy-compatible array operations on NVIDIA GPUs. import cupy as cp. x_gpu = cp.array(x_cpu) transfers a NumPy array to the GPU. result_gpu = cp.dot(A_gpu, B_gpu) computes a matrix multiplication on the GPU. result_cpu = result_gpu.get() transfers the result back to CPU memory. For large matrix operations (linear algebra, FFT, convolution), GPUs provide 10x to 100x speedup over CPU. The key is minimizing data transfer between CPU and GPU memory, which is the bottleneck: keep data on the GPU for as many operations as possible before transferring results back.
Choosing the right parallelization strategy depends on the problem. For independent file processing: ProcessPoolExecutor. For network I/O: ThreadPoolExecutor or asyncio. For large NumPy/pandas operations: Dask. For tight numerical loops: Numba. For massive matrix operations: CuPy/GPU. For most scientific Python code, the biggest speedups come not from parallelism but from vectorization: replacing Python loops with NumPy operations. Parallelize only after profiling confirms that the bottleneck is a correctly vectorized operation that simply needs more compute power.
Profile first, vectorize second, parallelize third. Most scientific Python code becomes fast enough with proper NumPy vectorization. Parallelism adds complexity that is only justified after profiling confirms a genuine bottleneck.