GPU Computing for Deep Learning: Why GPUs Matter and How They Work

Updated May 2026
Graphics processing units (GPUs) are the hardware that makes deep learning practical. A modern GPU performs matrix multiplications 10 to 100 times faster than a CPU by running thousands of simple operations simultaneously. Deep learning is fundamentally a series of matrix multiplications, which is why GPUs transformed AI from an academic curiosity into a technology powering billions of daily interactions. Understanding how GPUs work, how to choose them, and how to use them efficiently is essential knowledge for anyone training neural networks.

Why GPUs, Not CPUs

A CPU is designed for versatility. It has a small number of powerful cores (typically 8 to 64) that excel at complex, branching computation with many conditional decisions. Each core handles different instructions independently, optimized for low latency on individual tasks. A CPU can run an operating system, a web server, a database, and a video encoder simultaneously because each core is a general-purpose computing engine.

A GPU is designed for throughput. It has thousands of simple cores (NVIDIA's H100 has 16,896 CUDA cores) that perform the same operation on many pieces of data simultaneously. This design, called SIMD (Single Instruction, Multiple Data), is perfect for the matrix multiplications that dominate neural network computation. Multiplying a 1024x1024 matrix by another 1024x1024 matrix requires about 2 billion multiply-add operations, each independent of the others. A GPU distributes these operations across its thousands of cores and completes them in parallel.

The speedup is dramatic. Training a ResNet-50 image classification model on the ImageNet dataset takes about 29 hours on a single modern GPU (NVIDIA A100). The same training on a high-end CPU would take roughly 2 to 3 weeks. For larger models, the gap is even wider: training GPT-3 on CPUs alone would have taken decades. GPUs reduced the training time for large language models from "impossible within a human career" to "months on a cluster of thousands of GPUs."

The key metric is FLOPS: floating-point operations per second. An NVIDIA H100 GPU achieves approximately 756 teraFLOPS (756 trillion operations per second) at half precision (FP16). A high-end server CPU achieves roughly 2 to 4 teraFLOPS. This 200x gap in peak throughput is why every serious deep learning workload runs on GPUs. The actual realized speedup varies by workload but typically falls in the 10x to 100x range after accounting for memory transfers, overhead, and the degree to which the workload can be parallelized.

GPU Architecture for Deep Learning

CUDA Cores and Tensor Cores

Standard CUDA cores perform general-purpose parallel computation: floating-point multiplication and addition on individual numbers. They are versatile and handle any computation that can be parallelized. Tensor cores, introduced with NVIDIA's Volta architecture in 2017, are specialized for the specific operation deep learning needs most: multiplying two small matrices and adding the result to a third (the fused multiply-add operation, D = A * B + C, where A, B, C, and D are typically 4x4 or 8x8 matrices).

Tensor cores operate on reduced-precision data types. The most common is FP16 (16-bit floating point), which uses half the memory and bandwidth of FP32 (32-bit) while providing sufficient precision for neural network training. Newer tensor cores support BF16 (brain floating-point 16, which has the same exponent range as FP32 but reduced mantissa precision), FP8 (8-bit floating point for inference), and INT8 (8-bit integer for quantized inference). Mixed-precision training uses FP16 for most operations while maintaining a FP32 copy of the weights for the gradient accumulation step, getting the speed benefit of reduced precision without the accuracy loss.

Memory Hierarchy

GPU memory is organized in a hierarchy that critically affects deep learning performance. Global memory (HBM, High Bandwidth Memory) is the large pool available to all cores, typically 24 to 80 GB on modern cards. HBM3 on the H100 provides 3.35 terabytes per second of bandwidth. Shared memory (SRAM) is a much smaller but much faster pool (about 228 KB per streaming multiprocessor on H100) that groups of cores can use for temporary data. Registers are the fastest storage, private to each core.

Memory bandwidth, not compute, is often the bottleneck in deep learning. The GPU can perform computations faster than it can read data from global memory. Operations with low arithmetic intensity (few operations per byte of data read) are memory-bound, meaning the GPU's compute cores sit idle waiting for data. Optimized implementations like FlashAttention restructure computations to maximize data reuse in shared memory, dramatically reducing the number of global memory accesses and achieving 2 to 4x speedups on memory-bound operations.

GPU Memory Management

Running out of GPU memory is the most common practical problem in deep learning. A single GPU has a fixed amount of memory (24 GB for an RTX 4090, 80 GB for an A100 or H100), and during training this must hold the model parameters, optimizer states, gradient buffers, activations from the forward pass (needed for backpropagation), and the current batch of data. For large models, these requirements easily exceed available memory.

Model parameters consume memory proportional to the number of parameters times the bytes per parameter. A 7-billion-parameter model in FP16 requires 14 GB just for the weights. Optimizer states for Adam require two additional copies of the parameters (momentum and variance estimates), adding another 28 GB in FP16 or 56 GB in FP32. Gradients require another copy. Activations, the intermediate outputs of each layer that must be stored for backpropagation, can require more memory than the model itself for large batch sizes and deep networks.

Memory reduction techniques are essential for training large models. Gradient checkpointing trades computation for memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass, reducing activation memory from O(N) to O(sqrt(N)) at the cost of roughly 33% more computation. Mixed-precision training halves the memory for parameters and activations. Gradient accumulation simulates larger batch sizes without proportionally increasing memory by accumulating gradients over several small forward and backward passes before updating weights.

Multi-GPU and Distributed Training

When a model or dataset is too large for a single GPU, training must be distributed across multiple GPUs. Data parallelism is the simplest approach: each GPU holds a complete copy of the model and processes a different subset of the training data. After each batch, the GPUs synchronize their gradients (typically by all-reducing, which averages the gradients across all GPUs), and each GPU applies the same weight update. This scales well up to 8 to 16 GPUs on a single machine and beyond with fast inter-node networking.

Model parallelism splits the model itself across GPUs, with different layers or different parts of each layer on different devices. Tensor parallelism splits individual layers (like the attention heads in a transformer) across GPUs. Pipeline parallelism assigns different groups of layers to different GPUs and processes micro-batches in a pipeline, so all GPUs are active simultaneously. The Megatron-LM and DeepSpeed libraries provide implementations of these parallelism strategies that handle the complex communication patterns involved.

ZeRO (Zero Redundancy Optimizer) from Microsoft's DeepSpeed eliminates the redundant storage of optimizer states, gradients, and parameters across data-parallel GPUs. In standard data parallelism, each GPU stores a complete copy of everything. ZeRO partitions these across GPUs and gathers them on demand, reducing memory per GPU by up to 8x. This allows training models that are 8 times larger on the same hardware, at the cost of additional communication that is largely hidden behind computation.

Cloud GPU Options

Cloud computing has made GPU access available without the capital expenditure of buying hardware. AWS offers NVIDIA A100 and H100 instances (p4d.24xlarge and p5.48xlarge), with prices ranging from $10 to $100+ per hour depending on the GPU count and instance type. Google Cloud provides both NVIDIA GPUs and their own TPUs (Tensor Processing Units), which offer competitive performance for transformer workloads at lower cost in some configurations. Azure, Lambda Labs, CoreWeave, and smaller providers offer various GPU instances at competitive prices.

Spot/preemptible instances offer 60 to 90% discounts in exchange for the possibility of being interrupted when demand is high. For training runs that use checkpointing (saving model state periodically), spot instances are highly cost-effective: if interrupted, you lose at most the progress since the last checkpoint, typically 10 to 30 minutes. The total cost of a training run on spot instances is typically 2 to 5 times lower than on-demand pricing.

For individuals and small teams, consumer GPUs like the NVIDIA RTX 4090 (24 GB, roughly $1,600) provide excellent price-performance for models that fit in 24 GB of memory. For fine-tuning pre-trained language models, a single RTX 4090 with gradient checkpointing and mixed precision can handle models up to about 13 billion parameters. For larger models or full pre-training runs, cloud GPUs or multi-GPU setups become necessary.

Key Takeaway

GPUs accelerate deep learning by 10 to 100 times over CPUs through massive parallelism optimized for matrix operations. Memory management is the key practical challenge, addressed through mixed precision, gradient checkpointing, and distributed training. Cloud GPUs make this hardware accessible without large upfront investment, and spot instances dramatically reduce costs for fault-tolerant workloads.