What Is a CNN? Convolutional Neural Networks Explained

Updated May 2026
A convolutional neural network (CNN) is a type of neural network designed specifically for processing spatial data like images. Instead of connecting every pixel to every neuron (which would require billions of parameters), CNNs use small learnable filters that slide across the image, detecting local patterns like edges, textures, and shapes at every position. This weight-sharing approach makes CNNs dramatically more efficient than fully connected networks for visual tasks and has made them the standard architecture for image classification, object detection, and medical imaging since 2012.

Why Images Need Special Architecture

A 224x224 color image has 150,528 pixel values (224 * 224 * 3 channels). If the first layer were fully connected with just 1,000 neurons, it would need 150 million weights in that layer alone. For a 1,024x1,024 image, the number exceeds 3 billion. This is wildly impractical for images, but the real problem is not just the parameter count. It is that a fully connected layer ignores the spatial structure of the image entirely. It treats a pixel in the top-left corner identically to a pixel in the bottom-right, with no concept that nearby pixels are related or that the same patterns (edges, textures) can appear at different positions.

CNNs exploit two properties of images. First, local connectivity: nearby pixels are more related to each other than distant pixels. A neuron does not need to see the entire image to detect an edge; it only needs to see a small local region. Second, translation invariance: a cat ear looks the same whether it appears in the top-left or bottom-right of the image. The feature detector should be reusable across positions.

Convolution: The Core Operation

A convolutional layer applies a set of small filters (also called kernels) to the input. Each filter is a small matrix of weights, typically 3x3 or 5x5. The filter slides across the image one position at a time, and at each position, it computes a dot product between its weights and the corresponding patch of the image. The result is a single number that represents how strongly the filter's pattern is present at that location.

Sliding the filter across every position produces a feature map, a 2D array showing where in the image the filter's pattern was detected. A horizontal edge filter produces a feature map that is bright where horizontal edges exist and dark everywhere else. A convolutional layer typically has 32 to 512 different filters, each detecting a different pattern, producing that many feature maps.

The key efficiency comes from weight sharing. The same 3x3 filter (9 weights per channel) is applied at every position. A convolutional layer with 64 filters of size 3x3 on a 3-channel input has only 64 * 3 * 3 * 3 + 64 = 1,792 parameters, regardless of whether the image is 28x28 or 1024x1024. A fully connected layer processing the same images would have millions to billions of parameters.

Building the Feature Hierarchy

Stacking multiple convolutional layers creates a hierarchy of increasingly complex features, which is the real power of CNNs.

The first convolutional layer learns edge detectors. These are the most basic visual features: vertical edges, horizontal edges, diagonal edges at various angles, and color boundaries. Visualization studies show that every CNN trained on natural images discovers nearly identical first-layer filters, because edges are the universal building blocks of visual information.

The second and third layers combine edges into textures and simple shapes. A corner detector emerges from combining a horizontal edge detector and a vertical edge detector. A curve detector combines edge detectors at gradually changing angles. Repeating patterns of edges create texture detectors for fur, water, bricks, fabric, or tree bark.

Middle layers (layers 4 through 8 in a typical CNN) combine textures into object parts. A fur texture plus a pointed shape might activate a "cat ear" neuron. A circular arrangement of specific color gradients might activate an "eye" detector. At this level, features are clearly related to the objects the network is trained to recognize.

Deep layers combine object parts into complete object representations. The "cat ear" detector plus the "eye" detector plus a "whisker" detector arranged in the right spatial configuration becomes a "cat face" detector. The final layers produce representations that directly support the classification decision.

Pooling: Reducing Spatial Dimensions

Pooling layers reduce the spatial dimensions of feature maps, making the network more computationally efficient and providing a degree of translation invariance.

Max pooling is the most common type. It divides each feature map into non-overlapping regions (typically 2x2) and keeps only the maximum value in each region. This halves the width and height of the feature map, reducing the number of values by a factor of 4. The maximum value is kept because it represents the strongest detection of the feature in that region, the precise position within the region is discarded.

Average pooling takes the mean value in each region instead of the maximum. Global average pooling, applied to the entire feature map at the end of the network, produces a single value per feature map, replacing fully connected layers with a more parameter-efficient alternative. Many modern architectures use global average pooling rather than flattening the feature maps into a dense layer.

Some modern architectures (like those using strided convolutions) reduce spatial dimensions by using a stride greater than 1 in the convolutional layer itself, rather than adding separate pooling layers. This gives the network learnable downsampling rather than fixed operations.

Landmark CNN Architectures

LeNet-5 (1998) was one of the earliest successful CNNs, developed by Yann LeCun for handwritten digit recognition. It had just 5 layers and 60,000 parameters. LeNet proved that CNNs could learn useful features from images, but computational limitations prevented scaling to larger, more complex tasks.

AlexNet (2012) triggered the deep learning revolution by winning the ImageNet competition with a 10-percentage-point margin over the runner-up. It used 8 layers, 60 million parameters, ReLU activations (instead of sigmoid), and was trained on GPUs. AlexNet demonstrated that deep CNNs, given enough data and compute, could dramatically outperform hand-engineered features for image recognition.

VGGNet (2014) showed that deeper networks with smaller filters (all 3x3) outperformed shallower networks with larger filters. VGG-16 (16 layers) and VGG-19 (19 layers) were simple, uniform architectures that proved depth was the key variable. Their 138 million parameters made them expensive to train and run.

ResNet (2015) introduced residual (skip) connections that enabled training networks with over 100 layers. ResNet-152 achieved lower error than any previous architecture while being deeper than anyone thought practical. Skip connections solved the degradation problem where adding layers actually hurt performance, and they remain a standard component of nearly every deep network.

EfficientNet (2019) systematically scaled width, depth, and resolution together using a compound scaling coefficient. EfficientNet-B7 achieved better accuracy than ResNet while using 8.4 times fewer parameters, demonstrating that balanced scaling is more efficient than simply making networks deeper or wider.

CNNs Beyond Image Classification

While image classification was the first CNN application to gain widespread attention, the architecture has been adapted to many other visual tasks.

Object detection (YOLO, Faster R-CNN, SSD) uses CNNs to not only classify what is in an image but locate it with bounding boxes. These models process an image once and predict object categories and locations simultaneously, fast enough for real-time video processing.

Semantic segmentation (U-Net, DeepLab) classifies every pixel in an image, producing a detailed map of where each object or region is. This is essential for medical imaging, autonomous driving, and satellite image analysis.

Image generation uses CNNs in generative architectures. The decoder portion of a variational autoencoder uses transposed convolutions (learned upsampling) to generate images from compressed representations.

CNNs have also been applied successfully beyond images: to audio spectrograms (where time and frequency form a 2D structure), to 3D point clouds (3D convolutions), and even to certain types of graph data.

Key Takeaway

Convolutional neural networks process images efficiently by sliding small learned filters across the input, detecting local patterns that build from edges in early layers to complex objects in deep layers. Weight sharing makes CNNs dramatically more parameter-efficient than fully connected networks. The CNN architecture, refined through milestones from LeNet to EfficientNet, has been the standard for computer vision since AlexNet's 2012 breakthrough and remains dominant for spatial data processing.