Convolutional Networks Explained: How CNNs Process Images
How Convolution Works
The core operation in a CNN is convolution: a small matrix called a filter or kernel slides across the input image, computing a weighted sum at each position. A typical filter is 3x3 or 5x5 pixels. As it moves across the image (the "stride" determines how many pixels it shifts), it produces a new 2D array called a feature map. Each value in the feature map indicates how strongly the pattern encoded in that filter matches the corresponding region of the input image.
A single convolutional layer applies many filters simultaneously. The first convolutional layer of a typical network might have 64 filters, producing 64 separate feature maps. Each filter learns to detect a different pattern. One filter might become a vertical edge detector, producing high values wherever there is a sharp transition from dark to light pixels moving horizontally. Another might detect diagonal lines, color gradients, or specific textures. The network learns which patterns are useful during training by adjusting the numerical values in each filter.
The mathematical operation for a 3x3 filter on a single-channel (grayscale) image is straightforward. At each position, the filter's 9 values are multiplied element-wise with the 9 corresponding pixel values, and the products are summed. For a color image with three channels (red, green, blue), the filter has 3x3x3 = 27 values, and the sum includes all three channels. This produces a single number at each position, so a 3-channel input with 64 filters produces 64 single-channel feature maps.
Padding and stride control the output dimensions. "Same" padding adds zeros around the border of the input so the output feature map has the same spatial dimensions as the input. "Valid" padding (no padding) means the output is smaller than the input because the filter cannot extend beyond the edges. Stride determines how far the filter moves between positions: stride 1 means the filter shifts one pixel at a time, producing a full-resolution feature map, while stride 2 halves the spatial dimensions.
Pooling and Downsampling
Pooling layers reduce the spatial dimensions of feature maps, making the network more computationally efficient and more robust to small shifts in the input. The most common type is max pooling, which divides the feature map into non-overlapping regions (typically 2x2) and keeps only the maximum value in each region. This halves both the width and height, reducing the number of values by 75%.
Max pooling serves two purposes. First, it reduces computational cost: every subsequent layer operates on a smaller representation. A network that starts with a 224x224 input and pools three times reduces to 28x28, a 64x reduction in the number of spatial positions. Second, it introduces a degree of translation invariance. If a feature is detected at position (10, 10) in one image and at position (11, 11) in another, max pooling ensures that both detections produce the same pooled output, as long as they fall within the same pooling region.
Average pooling computes the mean instead of the maximum. It is less common in intermediate layers but widely used as the final pooling operation before the classification head. Global average pooling takes the average of each entire feature map, producing one number per channel. This eliminates fully connected layers at the end of the network, reducing parameter count and overfitting risk.
Architecture: Putting It All Together
A typical CNN follows a repeating pattern: convolution, activation (usually ReLU), convolution, activation, pooling, repeated several times with increasing numbers of filters. Early layers have fewer filters (32 or 64) because they detect simple patterns that do not require many detectors. Deeper layers have more filters (256, 512, or even 2048) because the number of possible complex patterns is much larger. The spatial dimensions shrink as you go deeper (through pooling or strided convolutions), while the channel dimension grows.
The final layers of a classification CNN convert the 3D feature volume into a flat vector and pass it through one or more fully connected layers that output class probabilities. A softmax function normalizes these outputs so they sum to 1, allowing the network to express confidence across categories. For example, a network classifying animals might output 0.92 for "cat," 0.05 for "dog," and small values for all other categories.
Landmark Architectures
AlexNet (2012) had 8 layers, 60 million parameters, and won the ImageNet competition by a dramatic margin. It proved that deep CNNs trained on GPUs could vastly outperform hand-engineered features. VGGNet (2014) showed that using many small 3x3 filters was more effective than fewer large filters, reaching 19 layers with 144 million parameters. GoogLeNet/Inception (2014) introduced "inception modules" that applied multiple filter sizes in parallel at each layer, achieving higher accuracy with far fewer parameters than VGG.
ResNet (2015) was the most significant architectural innovation. It introduced skip connections (also called residual connections) that allow the input to a layer to be added directly to its output. Mathematically, instead of learning a function F(x), each block learns the residual F(x) + x. This seemingly small change allowed networks to be trained with 50, 101, or even 152 layers, where previous architectures degraded in performance beyond 20 layers. ResNet-152 achieved a top-5 error rate of 3.6% on ImageNet, surpassing human performance.
EfficientNet (2019) used neural architecture search to find the optimal balance of network width, depth, and input resolution, achieving state-of-the-art accuracy with fewer parameters than previous architectures. ConvNeXt (2022) modernized the classic CNN design by incorporating ideas from transformers, such as larger kernel sizes and layer normalization, demonstrating that CNNs can match Vision Transformers when designed with equivalent training techniques.
What CNNs Learn to See
Visualizing what each layer has learned reveals the hierarchical nature of deep feature extraction. In a network trained on natural images, the first convolutional layer's filters closely resemble Gabor filters: oriented edge detectors at various angles, along with color-opponent detectors. These are strikingly similar to the receptive fields of neurons in the primary visual cortex (V1) of biological visual systems.
Middle layers detect textures, repeated patterns, and object parts. You can see filters that respond to fur-like textures, grid patterns, circular shapes, or specific color combinations. These features are not tied to any particular object but are building blocks that many object categories share. A circular shape detector is useful for wheels, eyes, oranges, and coins.
The final convolutional layers contain neurons that respond to complete objects or large object parts. Some neurons fire specifically for dog faces, others for car wheels, others for text. These high-level feature maps are what the classification layers use to make their predictions. The network has automatically learned a complete hierarchy of visual features, from the simplest edges to complete objects, all from labeled examples.
Beyond Image Classification
While CNNs were originally developed for classification, they have been adapted for many other tasks. Object detection networks like YOLO and Faster R-CNN use CNNs to simultaneously classify and localize multiple objects in an image, drawing bounding boxes around each detected object. Semantic segmentation networks like U-Net assign a class label to every pixel, which is critical for medical imaging and autonomous driving. Instance segmentation networks like Mask R-CNN distinguish between individual instances of the same class, separating overlapping objects.
CNNs also process non-image data that has grid structure. 1D convolutions work on time series and audio waveforms. 3D convolutions process video (2D spatial plus time) and volumetric data like CT scans and MRI volumes. The same principles apply: local filters detect patterns, depth builds abstraction, and pooling provides spatial robustness.
CNNs learn hierarchical visual features automatically through stacked convolutional layers, building from edges to textures to complete object representations. Their design principles, local connectivity, weight sharing, and spatial pooling, make them both computationally efficient and effective at capturing the spatial structure that matters in images.