How Neural Networks Process Images
Images as Numbers
A digital image is a grid of numbers. A grayscale image of 224x224 pixels is a 224x224 matrix where each value ranges from 0 (black) to 255 (white). A color image has three such matrices (channels), one for red, green, and blue. A 224x224 RGB image is a 224x224x3 tensor containing 150,528 individual numbers. This is what the neural network receives as input: nothing visual, just numbers arranged in a spatial grid.
Before entering the network, pixel values are typically normalized to a range of 0 to 1 (divide by 255) or standardized to zero mean and unit variance (subtract the dataset mean, divide by the standard deviation). This normalization ensures that the initial weight values and learning rate are appropriate for the input scale.
The First Convolutional Layer: Edge Detection
The first convolutional layer applies a set of small filters (typically 64 filters of size 3x3 or 7x7) to the input image. Each filter slides across every position, computing a dot product between its weights and the local image patch. The output is a feature map showing where in the image the filter's pattern was detected.
Visualization of first-layer filters reveals that they learn edge detectors: horizontal edges, vertical edges, diagonal edges at various angles, and color-contrast boundaries. This is universal across all CNNs trained on natural images. The network rediscovers the mathematical basis of edge detection from scratch because edges are the most informative local features for distinguishing between different objects and scenes.
A 3x3 filter for detecting horizontal edges might have weights like: top row [1, 1, 1], middle row [0, 0, 0], bottom row [-1, -1, -1]. This filter produces a large positive response where there is a transition from bright (top) to dark (bottom) and a large negative response for the opposite transition. The actual learned filters are not this clean, but they approximate these patterns closely.
Building Complexity Through Depth
Each subsequent convolutional layer receives the feature maps from the previous layer as input and produces new, more complex feature maps. The second layer's filters operate on edge maps rather than raw pixels, so they detect combinations of edges: corners (two edges meeting at an angle), curves (edges at gradually changing orientations), and simple textures (repeating edge patterns).
By the third and fourth layers, features represent recognizable textures and simple object parts. Filters might respond to fur patterns, metallic surfaces, fabric textures, or circular shapes. By layers five through eight, features represent complete object parts: eyes, wheels, windows, leaves. The final convolutional layers produce features that represent entire objects or object categories.
This hierarchy is not designed or programmed. It emerges from the training objective (minimize classification error) and the architectural constraints (small local filters, pooling for dimensionality reduction). The network discovers that hierarchical feature composition is the most efficient strategy for visual recognition, independently reproducing an organizational principle that neuroscience identified in the biological visual cortex decades earlier.
Pooling and Spatial Reduction
Pooling layers reduce the spatial dimensions of feature maps between convolutional blocks. Max pooling with a 2x2 window and stride 2 reduces each feature map to half its width and height. For each 2x2 region, only the maximum value is kept, representing the strongest detection of that feature in that area.
Pooling serves two purposes. First, it reduces computational cost: halving width and height reduces the number of values by 4x, making subsequent layers faster. Second, it provides a degree of translation invariance. After max pooling, the exact pixel position of a feature matters less, only whether it was present somewhere in the local region matters. This lets the network recognize a cat whether the cat is shifted slightly left or right.
Modern architectures often replace pooling with strided convolutions (convolutions with stride 2 that learn what information to keep and what to discard), which gives the network learnable downsampling rather than fixed operations.
The Classification Head
After the final convolutional block, the feature maps are converted to a flat vector and passed through one or more dense layers that produce the final classification. Global average pooling (averaging each feature map to a single value) is the most common approach, producing a vector with one value per feature map (e.g., 2,048 values for ResNet-50). This vector is a compact representation of the entire image, encoding which high-level features are present and how strongly.
A final dense layer with softmax activation maps this representation to class probabilities. For ImageNet with 1,000 classes, the output is a 1,000-dimensional probability vector. The class with the highest probability is the network's prediction. If the highest probability is 0.92 for "golden retriever," the network is 92% confident the image contains a golden retriever.
Vision Transformers: A Different Approach
Vision transformers (ViT) process images differently from CNNs. Instead of applying convolutional filters, ViT divides the image into fixed-size patches (typically 16x16 pixels), flattens each patch into a vector, projects it through a linear layer, adds a positional embedding, and processes the sequence of patch embeddings through standard transformer blocks.
Self-attention lets each patch attend to every other patch, capturing global relationships from the first layer. A patch containing a dog's ear can attend to a distant patch containing the dog's tail, establishing a connection that a CNN's local receptive field would need many layers to capture. This global connectivity is both ViT's strength (capturing long-range dependencies) and its weakness (quadratic computational cost with image size).
In practice, ViT requires more training data than CNNs to achieve comparable performance (because it has fewer built-in inductive biases like translation invariance), but with sufficient data and compute, ViT matches or exceeds CNN performance on most benchmarks. Hybrid architectures that combine convolutional stems (for efficient early feature extraction) with transformer blocks (for global reasoning) often achieve the best results.
What the Network Actually "Sees"
Feature visualization and attribution techniques reveal what aspects of an image drive the network's decisions. Gradient-weighted class activation maps (Grad-CAM) highlight which image regions contributed most to the classification. For a "golden retriever" classification, Grad-CAM typically highlights the dog's face and body, confirming that the network is using the right visual evidence.
However, networks sometimes rely on unexpected features. A classifier might use background cues (green grass for "cow," snow for "husky") rather than the object itself. Networks trained on hospital-specific X-ray datasets might recognize the hospital's equipment markings rather than the pathology. These shortcut features work on the training distribution but fail when the shortcuts are absent. Understanding what the network actually sees, rather than assuming it sees what a human would see, is essential for building trustworthy visual AI.
Neural networks process images by transforming raw pixel values through layers of learned filters that extract increasingly abstract features: edges, textures, parts, and complete objects. CNNs use local convolutional filters with weight sharing, while vision transformers use global attention across image patches. The final representation encodes which high-level features are present and feeds into a classification head. Understanding what the network actually uses for its decisions, through visualization tools like Grad-CAM, is critical for building reliable visual AI systems.