How Object Detection Works

Updated May 2026
Object detection is the computer vision task of identifying all objects of interest in an image and localizing each one with a bounding box. Unlike classification, which assigns a single label to the whole image, detection outputs a variable number of predictions, each consisting of a bounding box (x, y, width, height), a class label, and a confidence score. Modern detectors like YOLOv8 process images at over 100 frames per second while accurately detecting dozens of object categories simultaneously.

The Detection Problem

Object detection is substantially harder than image classification because the model must solve two problems simultaneously: recognizing what objects are present and determining where each one is located. An image might contain zero objects or fifty objects. They might be large or tiny, centered or in the corners, fully visible or partially hidden behind other objects. The model cannot know in advance how many predictions to make, and each prediction requires not just a class label but precise spatial coordinates.

Before deep learning, the dominant approach was the sliding window detector. A classification model would be applied to every possible rectangular region of the image at every possible scale, producing a classification score for each region. If the region scored above a threshold for any object class, it was flagged as a detection. This brute-force approach was computationally expensive because the number of possible rectangles in an image is astronomical. A 1000x1000 pixel image has roughly 250 billion possible rectangles. Even with tricks to prune the search space, sliding window detectors were slow and the rectangular window was a poor match for objects with unusual shapes or aspect ratios.

The Viola-Jones face detector (2001) made sliding windows practical for a single object class by using a cascade of increasingly complex classifiers. The first classifier in the cascade was extremely simple and fast, rejecting obvious non-face regions immediately. Only regions that passed the first classifier were evaluated by the second, more complex classifier. Only regions passing the second were sent to the third, and so on. This cascade structure meant that the computationally expensive classifiers were only applied to a small fraction of windows, enabling real-time face detection on 2001 hardware. But this approach only worked for a single predefined object class and required extensive engineering for each new category.

Two-Stage Detectors

The R-CNN (Regions with Convolutional Neural Networks) family introduced the two-stage paradigm that dominated detection from 2014 to 2017. The first stage generates region proposals: rectangular regions that likely contain objects, regardless of what kind of object. The second stage classifies each proposal and refines its bounding box. R-CNN (2014) used selective search to generate about 2,000 region proposals per image, then ran a CNN independently on each proposal. This worked well but was painfully slow, taking about 47 seconds per image because each proposal required a separate forward pass through the CNN.

Fast R-CNN (2015) eliminated this redundancy by running the CNN once on the entire image to produce a shared feature map, then extracting features for each proposal from this shared map using a technique called RoI (Region of Interest) pooling. This reduced per-image processing time from 47 seconds to about 0.3 seconds. Faster R-CNN (2016) replaced the external selective search algorithm with a Region Proposal Network (RPN), a small CNN that shares features with the detector and learns to propose regions end-to-end. This eliminated the computational bottleneck of region proposal generation and enabled fully end-to-end training where the proposal and classification stages optimize jointly.

Faster R-CNN introduced the concept of anchor boxes: predefined bounding boxes at each position on the feature map with different aspect ratios and scales. The RPN predicts, for each anchor, whether it contains an object (objectness score) and how to adjust its coordinates to better fit the actual object (bounding box regression). Typical configurations use 9 anchors per position (3 aspect ratios x 3 scales), generating roughly 20,000 anchor boxes across the feature map. The RPN scores and filters these to produce 300 to 2,000 high-quality proposals that are passed to the second stage for classification and further refinement.

Single-Stage Detectors

YOLO (You Only Look Once, 2016) took a radically different approach by framing detection as a single regression problem. Instead of generating proposals and then classifying them, YOLO divides the image into a grid (for example, 7x7 cells) and predicts bounding boxes and class probabilities directly for each cell. Each cell predicts a fixed number of bounding boxes along with confidence scores. The network processes the entire image in a single forward pass, outputting all detections simultaneously. This made YOLO dramatically faster than two-stage detectors: the original YOLO ran at 45 frames per second, and a smaller variant (Fast YOLO) reached 155 frames per second, while Faster R-CNN managed 5 to 17 frames per second.

The original YOLO had limitations: its coarse grid made it struggle with small objects and groups of small objects within the same grid cell. YOLOv2 (2017) addressed these issues with anchor boxes, batch normalization, and multi-scale training. YOLOv3 (2018) added predictions at three different scales, improving small object detection substantially. The YOLO lineage continued through YOLOv4, YOLOv5, and beyond, with each version incorporating architectural improvements from the broader research community. YOLOv8 (2023) uses an anchor-free design, decoupled detection heads for classification and localization, and modern training recipes that achieve competitive accuracy with state-of-the-art two-stage detectors while maintaining real-time speed.

SSD (Single Shot MultiBox Detector, 2016) took a similar approach to YOLO but made predictions from multiple feature maps at different resolutions. Earlier feature maps with higher spatial resolution detected small objects, while later feature maps with lower resolution but larger receptive fields detected large objects. This multi-scale prediction strategy addressed the small object problem more elegantly than early YOLO versions. RetinaNet (2017) introduced focal loss, which down-weights the loss contribution from easy examples (typically background) and focuses training on hard examples (objects that are difficult to detect), solving the class imbalance problem that made single-stage detectors less accurate than two-stage detectors.

Non-Maximum Suppression and Post-Processing

Both detection paradigms produce many overlapping predictions for each object. A single car in an image might generate 20 to 50 overlapping bounding boxes, all correctly identifying it as a car but with slightly different positions and sizes. Non-maximum suppression (NMS) filters these redundant predictions by keeping only the highest-confidence box for each object. The algorithm sorts all predictions by confidence, takes the top prediction, removes all other predictions that overlap with it beyond a threshold (typically 50% Intersection over Union), then repeats with the next highest remaining prediction until all predictions are either kept or removed.

Intersection over Union (IoU) is the standard metric for measuring how well a predicted bounding box matches a ground truth box. IoU equals the area of overlap between the two boxes divided by the area of their union. An IoU of 1.0 means perfect overlap. An IoU of 0 means no overlap. The standard threshold for counting a prediction as correct is 0.5 IoU, meaning the predicted box must overlap at least 50% with the ground truth. More stringent evaluations use higher thresholds (0.75 or 0.95), and COCO evaluation averages performance across IoU thresholds from 0.5 to 0.95 in steps of 0.05.

Mean Average Precision (mAP) is the primary metric for evaluating object detectors. For each object class, the detector's predictions are ranked by confidence and precision-recall values are computed at each rank. The area under this precision-recall curve gives the Average Precision (AP) for that class. mAP averages AP across all object classes. COCO's primary metric, mAP@[.5:.95], averages across both classes and IoU thresholds, providing a comprehensive measure of detection quality. State-of-the-art detectors achieve mAP above 55% on COCO, which contains 80 diverse object categories in complex scenes.

Anchor-Free and Transformer-Based Detection

Recent detection research has moved toward anchor-free designs that eliminate predefined anchor boxes. Instead of predicting offsets relative to anchors, these models predict object centers and dimensions directly. CenterNet (2019) detects objects by finding their center points as peaks in a heatmap, then regressing the object's width and height from the center point. FCOS (Fully Convolutional One-Stage, 2019) predicts the distances from each feature map location to the four sides of the bounding box. These approaches simplify the training pipeline by removing hyperparameters related to anchor design (number of scales, aspect ratios, sizes) that required careful tuning for each dataset.

DETR (Detection Transformer, 2020) applied the transformer architecture to detection, treating it as a set prediction problem. DETR uses a CNN backbone to extract image features, a transformer encoder-decoder to process them, and a bipartite matching loss that associates each ground truth object with exactly one prediction. This eliminates the need for NMS, anchor boxes, and many other hand-designed components of traditional detectors. The simplified architecture is elegant but DETR initially trained slowly and struggled with small objects. Deformable DETR, DINO, and subsequent variants addressed these limitations, making transformer-based detection competitive with the best CNN-based approaches.

Detection in the Real World

Autonomous driving relies on object detection more than any other application. A self-driving vehicle must detect cars, trucks, motorcycles, bicycles, pedestrians, traffic signs, traffic lights, lane markings, construction zones, and unexpected obstacles in real time. The detector must handle extreme variation in lighting (tunnels, direct sun, nighttime headlights), weather (rain, snow, fog), and occlusion (a pedestrian partially hidden behind a parked car). Missing a single pedestrian detection can be fatal, making this the most safety-critical detection application in existence. Production autonomous driving systems use multiple detection models, redundant sensor modalities (camera, lidar, radar), and extensive validation against billions of miles of driving data.

Retail inventory management uses overhead cameras with object detectors to monitor shelf stock levels in real time. The system detects which products are present, which shelf positions are empty, and whether products are in their correct locations. This replaces manual shelf audits that are labor-intensive and infrequent. Warehouse logistics uses detection to locate packages, verify labels, and guide robotic picking systems. Amazon's fulfillment centers process millions of items daily using vision-guided robots that detect, grasp, and sort packages at speeds exceeding human workers.

Wildlife monitoring deploys camera traps in natural habitats that capture millions of images. Object detectors identify which species are present in each image, their count, and their behavior, replacing the manual review process that previously required thousands of volunteer hours. The Snapshot Serengeti project, for example, collected over 7 million images from 225 camera traps in Tanzania. Training detectors on these datasets enables continuous, scalable monitoring of wildlife populations, migration patterns, and ecosystem health without disturbing the animals.

Key Takeaway

Object detection finds and localizes every object in an image using either two-stage detectors (higher accuracy) or single-stage detectors (real-time speed), with modern architectures like YOLOv8 and DETR closing the gap between the two approaches.