Hardware for Deep Learning: GPUs, TPUs, and Custom AI Accelerators

Updated May 2026
The hardware running deep learning systems determines what is possible. NVIDIA GPUs dominate both training and inference, with their A100 and H100 chips powering the vast majority of AI workloads. Google's TPUs offer competitive performance on Google Cloud. A growing ecosystem of custom AI accelerators targets specific use cases from datacenter inference to smartphone AI. Choosing the right hardware involves balancing computational throughput, memory capacity, interconnect bandwidth, power consumption, and cost for your specific workload, whether that is training a frontier model on thousands of chips or running a classifier on a mobile phone.

NVIDIA GPUs: The Industry Standard

NVIDIA controls approximately 80% of the AI accelerator market, a dominance built on three pillars: superior hardware, the CUDA software ecosystem, and deep integration with every major deep learning framework. The CUDA programming model, which lets developers write GPU-accelerated code in C++ and Python, has been the standard for GPU computing since 2007. Virtually every deep learning library, from PyTorch to TensorFlow to JAX, uses CUDA under the hood. This software ecosystem creates a powerful lock-in effect: even if a competitor builds faster hardware, the absence of CUDA support makes adoption difficult.

Datacenter GPUs

The NVIDIA A100, based on the Ampere architecture, was the workhorse of AI training from 2020 to 2023. It features 80 GB of HBM2e memory at 2 TB/s bandwidth, 312 TFLOPS of FP16 tensor core performance, and NVLink interconnects for multi-GPU communication at 600 GB/s. The A100 remains widely deployed in cloud datacenters and represents the baseline hardware that most large models were trained on.

The H100, based on the Hopper architecture, more than doubled the A100's performance. Its fourth-generation tensor cores deliver 756 TFLOPS at FP16, and the new FP8 data type pushes inference throughput to 1,513 TFLOPS. HBM3 memory provides 3.35 TB/s bandwidth with 80 GB capacity. The Transformer Engine, a feature specific to the H100, automatically manages mixed-precision computation during transformer training, switching between FP8 and FP16 within each layer to maximize speed while maintaining accuracy.

NVIDIA's B200 and GB200 chips (Blackwell architecture, available 2024-2025) push further: the B200 achieves up to 4,500 TFLOPS at FP4 precision and 144 GB of HBM3e memory. The GB200 pairs two B200 GPUs with a Grace CPU on a single module, connected by NVLink-C2C at 900 GB/s. These chips are designed for the era of trillion-parameter models, where the memory and bandwidth requirements exceed what any single previous-generation chip could provide.

Consumer GPUs

NVIDIA's consumer RTX series provides excellent price-performance for individuals and small teams. The RTX 4090 (24 GB GDDR6X, approximately $1,600) delivers 82.6 TFLOPS of FP16 tensor core performance, making it capable of fine-tuning models up to about 13 billion parameters with quantization and gradient checkpointing. The RTX 4080 (16 GB, approximately $1,000) handles models up to about 7 billion parameters. These cards lack the memory and interconnect bandwidth of datacenter GPUs but provide enough capability for research, fine-tuning, and inference at a fraction of the cost.

The RTX 5090 (32 GB GDDR7, available 2025) increases memory to 32 GB, a meaningful threshold that allows fine-tuning larger models without quantization. For practitioners who need to experiment with models in the 7 to 30 billion parameter range, the jump from 24 to 32 GB of VRAM is significant, because it determines whether the model fits in memory without resorting to techniques that reduce quality or speed.

Google TPUs

Tensor Processing Units (TPUs) are custom AI accelerators designed by Google specifically for neural network workloads. TPUs use a systolic array architecture that is optimized for the matrix multiplications at the heart of deep learning, achieving high utilization on these specific operations. TPUs are available exclusively through Google Cloud and are used internally at Google for training their largest models, including Gemini.

TPU v4, deployed in large pods of up to 4,096 chips connected by a custom 3D torus interconnect, was used to train PaLM and other Google research models. Each TPU v4 chip provides 275 TFLOPS at BF16 and 32 GB of HBM memory. The 3D torus interconnect provides extremely high bisection bandwidth, making TPU pods particularly efficient for the all-reduce communication patterns used in distributed training. TPU v5e and v5p further improved performance and memory, with the v5p offering 459 TFLOPS and 95 GB HBM per chip.

The primary advantage of TPUs is cost-effectiveness on Google Cloud for large-scale training. For training transformer models at scale, TPUs can deliver better price-performance than NVIDIA GPUs, particularly when using Google's JAX framework, which is optimized for TPU execution. The primary disadvantage is ecosystem lock-in: TPUs only run on Google Cloud, only support certain frameworks well (JAX and TensorFlow), and have a smaller community than CUDA GPUs. PyTorch support for TPUs exists through the PyTorch/XLA library but is less mature than native CUDA support.

Custom AI Accelerators

The AI chip landscape extends well beyond NVIDIA and Google. AMD's MI300X (192 GB HBM3, 1,307 TFLOPS at FP16) positions AMD as a serious datacenter AI competitor, with growing software support through the ROCm platform that provides a CUDA-like programming model. Intel's Gaudi 3 targets the training market with competitive performance at lower prices. Cerebras' Wafer-Scale Engine is a single chip the size of an entire silicon wafer, with 900,000 cores and 40 GB of on-chip SRAM, designed to eliminate the communication bottlenecks of multi-chip systems.

Inference-focused accelerators prioritize throughput and efficiency over the flexibility needed for training. Groq's Language Processing Units achieve extremely low latency for transformer inference through a deterministic execution model that eliminates the scheduling overhead of GPUs. SambaNova's DataScale accelerators target enterprise AI workloads. AWS Inferentia and Trainium provide cost-effective AI compute within the AWS ecosystem. These specialized chips often achieve 2 to 5x better price-performance than general-purpose GPUs for their target workloads.

Edge and Mobile AI Hardware

Running AI models on edge devices, smartphones, cameras, robots, vehicles, and IoT sensors, requires hardware that balances inference capability with power consumption, heat, and physical size. Apple's Neural Engine, integrated into every iPhone and Mac, provides up to 38 TOPS (trillion operations per second) for on-device inference, powering features like Face ID, photo processing, and Siri. Qualcomm's Hexagon DSP and AI Engine provide similar capabilities for Android devices.

NVIDIA's Jetson platform targets robotics and embedded AI, with the Jetson Orin NX providing 100 TOPS in a module roughly the size of a credit card. This enables real-time object detection, path planning, and sensor fusion on robots, drones, and autonomous vehicles without cloud connectivity. Google's Edge TPU provides 4 TOPS in a tiny package designed for IoT applications like smart cameras and voice recognition devices.

The key technique for deploying models on edge hardware is quantization: reducing the precision of model weights from 32-bit floating point to 8-bit integers or even 4-bit integers. This reduces model size by 4 to 8 times, increases inference speed (integer operations are faster than floating-point on most hardware), and reduces power consumption. Quantization-aware training and post-training quantization tools in TensorFlow Lite, ONNX Runtime, and PyTorch Mobile make this process largely automated.

Choosing Hardware

For learning and small experiments, a single consumer NVIDIA GPU (RTX 4070 or better) is sufficient. For serious research and fine-tuning, an RTX 4090 or cloud GPU instances (A100 or H100 via AWS, GCP, or Lambda Labs) provide the necessary memory and speed. For training models with more than 30 billion parameters, multi-GPU setups or cloud clusters are required. For frontier model training at the 100B+ parameter scale, you need hundreds to thousands of datacenter GPUs or TPUs, which practically means either building your own cluster or renting from a major cloud provider.

The most important specification is usually memory, not compute. Modern models are often memory-bound: the GPU can compute faster than it can read data from memory. A card with 80 GB of slower memory may be more useful than a card with 24 GB of faster memory, because it can handle larger models and batch sizes without the overhead of offloading, gradient checkpointing, or model parallelism. For inference, memory determines the largest model you can serve; for training, it determines the largest model you can fine-tune without resorting to memory-saving techniques.

Key Takeaway

NVIDIA GPUs dominate deep learning through superior hardware and the CUDA ecosystem. Google TPUs offer competitive performance for transformer workloads on Google Cloud. Custom accelerators are emerging for specific use cases. Memory capacity is often more important than raw compute throughput, and quantization enables deployment on resource-constrained edge devices.