What Are GANs? Generative Adversarial Networks Explained
The Adversarial Framework
Ian Goodfellow introduced GANs in 2014 with an analogy: the generator is a counterfeiter trying to produce fake currency, and the discriminator is a detective trying to detect the fakes. As the detective gets better at spotting fakes, the counterfeiter is forced to improve. As the counterfeiter improves, the detective must become more discerning. This competitive dynamic drives both networks toward excellence.
Formally, the generator G takes a random noise vector z (sampled from a simple distribution like a standard Gaussian) and outputs a synthetic data sample G(z). The discriminator D takes a data sample (either real or generated) and outputs a probability that the sample is real. Training alternates between two objectives: update D to correctly classify real samples as real and generated samples as fake, then update G to fool D into classifying generated samples as real.
The theoretical equilibrium is a Nash equilibrium where the generator produces samples indistinguishable from real data and the discriminator outputs 0.5 for every input (50% confident, essentially guessing). In practice, training rarely reaches this ideal equilibrium, but the adversarial dynamic consistently produces high-quality generative models.
The Generator
The generator maps from a low-dimensional noise vector (typically 100 to 512 dimensions) to a high-dimensional output (an image with thousands or millions of pixels). This mapping is learned entirely during training; the generator starts producing random noise and gradually learns to produce structured, realistic outputs.
For image generation, the generator typically uses transposed convolutions (also called deconvolutions), which progressively upsample from a small spatial resolution to the target resolution. A generator might start with a 4x4 feature map and upsample through layers of 8x8, 16x16, 32x32, 64x64, 128x128, and 256x256, adding detail at each level. Batch normalization and ReLU activations are standard between layers, with tanh activation at the output to produce pixel values in the range -1 to 1.
The noise vector z acts as a seed that determines which specific image is generated. Different z vectors produce different images, and interpolating smoothly between two z vectors produces a smooth transition between the corresponding images. This latent space navigation is one of the most useful properties of GANs, enabling controlled image manipulation by moving in specific directions within the latent space.
The Discriminator
The discriminator is a standard classification network (typically a CNN for image GANs) that takes an image as input and outputs a single probability. It is trained on a mix of real images from the training dataset and fake images from the generator, with the objective of correctly labeling each.
The discriminator serves two roles. First, it provides the training signal for the generator. The generator has no direct access to real images; the only feedback it receives is how successfully it fooled the discriminator. Second, a well-trained discriminator has learned a rich representation of what makes images realistic, capturing texture consistency, structural coherence, color distributions, and other properties that distinguish real from fake.
After training, the discriminator is typically discarded. Only the generator is used in deployment. However, the discriminator's learned features are sometimes repurposed for other tasks like anomaly detection or semi-supervised classification.
Training Dynamics and Challenges
GAN training is notoriously unstable because it is a minimax game rather than a simple minimization. Instead of following a gradient downhill, the two networks are pulling in opposing directions, and the training process must balance their competing objectives.
Mode collapse is the most common failure. The generator discovers a small set of outputs that consistently fool the discriminator and produces only those, ignoring the diversity of the real data distribution. A face generator experiencing mode collapse might produce only one specific face or a handful of very similar faces. The generator has found a local optimum that satisfies its objective (fool the discriminator) without actually learning the full data distribution.
Training instability manifests as oscillating loss curves, where the generator and discriminator trade dominance without converging. If the discriminator becomes too strong too quickly, its gradients become uninformative for the generator (the signal is just "everything you make is fake" with no indication of how to improve). If the generator becomes too strong, the discriminator cannot provide useful training signal because it cannot distinguish real from fake.
Evaluation difficulty is a fundamental challenge. Unlike classifiers where accuracy is a clear metric, measuring the quality of generated images is subjective and multi-dimensional. The Frechet Inception Distance (FID) score, which compares the distribution of features in generated images to real images using a pre-trained Inception network, has become the standard metric. Lower FID indicates more realistic and diverse generation. But FID does not capture every aspect of quality, and improving FID does not always correspond to perceptually better images.
Major GAN Variants
DCGAN (Deep Convolutional GAN, 2015) established the architectural conventions that most subsequent GANs follow: all-convolutional layers (no pooling), batch normalization in both networks, ReLU in the generator and LeakyReLU in the discriminator, and the removal of fully connected layers except at the input and output. These guidelines transformed GAN training from unreliable to mostly stable.
Wasserstein GAN (WGAN, 2017) replaced the standard GAN loss with the Wasserstein distance, which provides smoother gradients and more meaningful loss curves. WGAN training is significantly more stable than standard GAN training, and the loss correlates with sample quality, meaning you can actually monitor training progress from the loss curve.
Progressive GAN (2017) trains by starting at low resolution (4x4) and progressively adding higher-resolution layers during training. This curriculum approach stabilizes training for high-resolution outputs and was the first architecture to generate photorealistic 1024x1024 face images.
StyleGAN (2018) and StyleGAN2 (2020) introduced a style-based generator architecture that gives fine-grained control over generated images. Coarse styles (pose, face shape) are controlled by early layers, while fine styles (hair texture, skin detail) are controlled by later layers. StyleGAN2 produced the most photorealistic face images ever generated at the time, to the point where generated faces were indistinguishable from photographs even to trained observers.
Conditional GANs (cGAN) add a conditioning signal (a class label, text description, or input image) to both the generator and discriminator. This allows targeted generation: "generate a face with glasses and brown hair" or "convert this sketch to a photorealistic image." Pix2pix (image-to-image translation) and CycleGAN (unpaired image translation) are prominent conditional GAN variants.
Applications
Image synthesis. Generating photorealistic faces, scenes, objects, and artwork. This Person Does Not Exist (thispersondoesnotexist.com) demonstrated StyleGAN's capabilities to a broad audience.
Image-to-image translation. Converting satellite images to maps, sketches to photographs, day scenes to night scenes, and low-resolution images to high-resolution. These applications have practical value in design, entertainment, and scientific imaging.
Data augmentation. Generating synthetic training examples for domains where real data is scarce or expensive. Medical imaging, where labeled data is limited by privacy and annotation costs, has benefited from GAN-generated synthetic scans.
Super-resolution. Enhancing image resolution beyond what the original capture provides. ESRGAN and similar models can plausibly upscale images by 4x or more, adding realistic detail that was not in the original.
GANs vs. Diffusion Models
Since 2022, diffusion models (DALL-E 2, Stable Diffusion, Midjourney) have surpassed GANs as the preferred architecture for image generation. Diffusion models produce higher-quality images, exhibit greater diversity (less mode collapse), and are easier to train (no adversarial instability). GANs retain advantages in generation speed (a single forward pass vs. hundreds of denoising steps) and in specific applications like real-time video synthesis.
The GAN framework remains influential conceptually, and adversarial training components appear in many modern systems. Understanding GANs is valuable both for the applications where they remain competitive and for the foundational ideas they contributed to generative AI.
GANs train two competing networks, a generator that creates synthetic data and a discriminator that evaluates authenticity, using adversarial dynamics to produce increasingly realistic outputs. StyleGAN demonstrated photorealistic face generation, and conditional GANs enabled image-to-image translation. While diffusion models have surpassed GANs for most generation tasks since 2022, GANs remain relevant for speed-critical applications and contributed foundational ideas to modern generative AI.