3D Computer Vision

Updated May 2026
3D computer vision recovers the three-dimensional structure of the physical world from two-dimensional visual input. It encompasses techniques for estimating depth from images, reconstructing 3D shapes from photographs, processing point cloud data from lidar sensors, and synthesizing novel views of scenes that were never directly photographed. These capabilities enable augmented reality, autonomous navigation, robotics manipulation, architectural modeling, and digital twin creation.

The Fundamental Challenge: From 2D to 3D

A photograph is a projection of a 3D world onto a 2D sensor. This projection loses depth information: every point along a ray from the camera through a pixel projects to the same location in the image. A basketball held close to the camera and a building far away can appear the same size in the image. Recovering the lost depth dimension from one or more 2D images is the central challenge of 3D computer vision. Different techniques approach this challenge using different sources of information: stereo correspondence between two views, motion parallax between frames of a video, learned monocular depth cues, or active sensors that directly measure distance.

The mathematical foundation of 3D vision is projective geometry, which describes how 3D points map to 2D image coordinates through the camera's optical system. A pinhole camera model defines this mapping with a 3x4 projection matrix that encodes the camera's focal length, principal point, and its position and orientation in the world. Camera calibration determines these parameters from observations of known geometric patterns (like a checkerboard). Once calibrated, the camera geometry enables triangulation: given a point visible in two calibrated cameras, its 3D position can be computed by finding where the two viewing rays intersect. The accuracy of 3D reconstruction depends critically on the accuracy of camera calibration.

Stereo Vision and Depth Estimation

Stereo vision uses two cameras separated by a known baseline distance (like human eyes, which are separated by about 63 millimeters) to estimate depth through triangulation. For each pixel in the left image, the system searches for the corresponding pixel in the right image. The displacement between these corresponding pixels, called disparity, is inversely proportional to depth: nearby objects have large disparities (they shift more between views) while distant objects have small disparities. Converting a dense disparity map to a depth map requires only the camera baseline and focal length.

Stereo matching, finding the correct correspondence between left and right image pixels, is the hard part. Classical algorithms like Semi-Global Matching (SGM) compute matching costs for each possible disparity at each pixel, then optimize for globally consistent disparity maps using dynamic programming along multiple image paths. Deep learning approaches like GC-Net, PSMNet, and RAFT-Stereo use 3D convolutions over a cost volume to learn matching functions that handle textureless surfaces, repetitive patterns, and occluded regions more robustly than classical methods. RAFT-Stereo achieves sub-pixel disparity accuracy on standard benchmarks, producing depth maps detailed enough for precise 3D measurement.

Monocular depth estimation predicts depth from a single image, which is mathematically impossible without additional assumptions because infinitely many 3D scenes can produce the same 2D image. Deep networks learn to exploit monocular depth cues that humans also use: perspective convergence (parallel lines meet at a horizon), relative size (familiar objects appear smaller when farther away), texture gradient (surface textures become finer with distance), atmospheric perspective (distant objects appear hazier), and occlusion (closer objects block farther ones). MiDaS, DPT, and Depth Anything produce dense relative depth maps from single images that are qualitatively accurate but lack the metric precision of stereo or lidar measurements. ZoeDepth and Metric3D combine relative depth estimation with metric scale recovery, producing absolute depth measurements from single images.

Structure from Motion and Multi-View Reconstruction

Structure from Motion (SfM) reconstructs 3D scene geometry from a collection of photographs taken from different viewpoints. Given a set of images of a building, statue, or landscape taken from various positions, SfM simultaneously estimates the 3D positions of distinctive scene points and the camera positions and orientations from which each photograph was taken. The algorithm works by detecting feature points (like SIFT or SuperPoint keypoints) in each image, matching features between image pairs, computing the relative camera poses from these matches using epipolar geometry, and triangulating the matched feature points into 3D coordinates.

COLMAP is the standard open-source SfM pipeline, handling everything from feature extraction through bundle adjustment (the global optimization that simultaneously refines all camera poses and 3D point positions to minimize reprojection error). Given a few hundred photographs of a building, COLMAP produces a sparse 3D point cloud containing thousands to millions of 3D points, along with the precise position and orientation of every camera. This sparse reconstruction serves as input for dense reconstruction methods that fill in the gaps. Multi-View Stereo (MVS) algorithms like COLMAP-MVS and OpenMVS compute dense depth maps for each camera view, then fuse them into a complete 3D mesh or point cloud.

SLAM (Simultaneous Localization and Mapping) performs SfM in real time, building a map of the environment while simultaneously tracking the camera's position within that map. Visual SLAM systems like ORB-SLAM3 process video frames as they arrive, matching features between consecutive frames to estimate motion and triangulate new 3D points. This enables augmented reality (placing virtual objects in the physical world requires knowing where the camera is and what the environment looks like), robot navigation (building a map while exploring), and real-time 3D scanning. Apple's ARKit and Google's ARCore use visual-inertial SLAM, combining camera tracking with accelerometer and gyroscope data for robust real-time 3D understanding on smartphones.

Neural 3D Representations

Neural Radiance Fields (NeRF, 2020) introduced a fundamentally new approach to 3D scene representation. Instead of reconstructing explicit geometry (meshes, point clouds), NeRF trains a neural network to represent the scene as a continuous function that maps any 3D coordinate and viewing direction to a color and density value. Given a collection of photographs with known camera positions, NeRF optimizes the network so that rendering the scene from each training camera position produces an image that matches the actual photograph. Once trained, the network can synthesize photorealistic views from any camera position, including viewpoints that were never photographed.

The quality of NeRF's novel view synthesis was groundbreaking, but its training and rendering speed was impractical: training took hours and rendering a single frame took minutes. Instant-NGP (2022) accelerated NeRF training from hours to seconds using hash-encoded feature grids instead of large neural networks. 3D Gaussian Splatting (2023) took a different approach entirely, representing the scene as millions of 3D Gaussian primitives, each with a position, covariance (shape), color, and opacity. Rendering projects these Gaussians onto the image plane and composites them in depth order, enabling real-time rendering at 100+ frames per second while matching or exceeding NeRF's visual quality. Gaussian splatting's real-time capability has made neural 3D representation practical for interactive applications.

These neural representations are transforming how 3D content is created. Photographers can capture a scene with a smartphone video and produce an explorable 3D representation within minutes. Real estate virtual tours, cultural heritage preservation, movie visual effects previsualization, and game environment scanning all benefit from the ability to turn casual photographs into photorealistic 3D scenes. The technology is converging with generative AI: models like Zero-1-to-3, DreamFusion, and LGM generate 3D objects from single images or text descriptions, combining 2D generative models with 3D representations.

Point Cloud Processing and Lidar

Lidar (Light Detection and Ranging) sensors directly measure 3D distances by emitting laser pulses and measuring their return time. A spinning lidar sensor produces a point cloud: a collection of 3D coordinates, typically 100,000 to 300,000 points per scan for automotive lidar. Unlike camera-derived depth, lidar provides precise, absolute distance measurements unaffected by lighting conditions, textures, or reflections. Automotive lidar achieves centimeter-level accuracy at ranges up to 200 meters. Aerial lidar, mounted on aircraft or drones, maps terrain beneath forest canopy by detecting ground-reflected pulses that pass through gaps in the foliage.

Processing point clouds with neural networks requires architectures that handle unordered, irregularly spaced 3D points. PointNet (2017) was the first deep learning architecture to process raw point clouds directly, using per-point feature extraction followed by a global pooling operation that is invariant to point ordering. PointNet++ (2017) added hierarchical local feature aggregation, improving performance on fine-grained recognition tasks. Point Transformer applied self-attention to 3D point clouds, enabling each point to attend to its neighbors. These architectures power 3D object detection, semantic segmentation, and classification on lidar data for autonomous driving, where identifying every nearby vehicle, pedestrian, and obstacle in 3D is safety-critical.

Sensor fusion combines lidar point clouds with camera images to get the best of both modalities. Lidar provides precise 3D geometry but sparse sampling and no color or texture information. Cameras provide dense color images and rich texture but imprecise depth. Fusion projects lidar points onto camera images (using calibrated camera-lidar geometry) to create dense, colored 3D representations, or lifts camera features into 3D space using lidar depth as an anchor. BEVFusion and TransFusion represent the current state of the art in camera-lidar fusion for autonomous driving, producing 3D detection results that exceed either modality alone.

Applications of 3D Vision

Augmented reality depends on 3D vision for every aspect of its operation. Placing virtual furniture in a room requires understanding the room's 3D layout: where are the floor, walls, and existing furniture. Virtual objects must appear to rest on real surfaces, cast realistic shadows, and be occluded by real objects that are closer to the camera. Apple's LiDAR-equipped iPhones and iPads use depth sensing for instant room scanning, enabling AR applications that were previously limited to headsets. Meta's Quest headsets use multiple cameras and depth sensors for inside-out tracking and mixed reality experiences.

Autonomous robotics relies on 3D vision for navigation, manipulation, and collision avoidance. A warehouse robot needs a real-time 3D map of its surroundings to plan paths between shelving units. A robotic arm needs to estimate the 3D pose of objects on a conveyor belt to plan grasps. Surgical robots need millimeter-precise 3D vision of tissue surfaces. Each application has different requirements for range, precision, update rate, and computational budget, driving the diversity of 3D vision approaches from single-camera monocular estimation to multi-sensor fusion systems.

Digital twins create virtual replicas of physical spaces that can be explored, measured, and analyzed remotely. Architectural firms scan building sites with lidar and photogrammetry to create detailed 3D models before beginning design work. Construction companies compare as-built 3D scans against architectural plans to detect deviations early. Facility managers maintain digital twins of factories and warehouses for space planning and maintenance scheduling. These applications require not just visual 3D reconstruction but dimensionally accurate models where distances and angles in the digital twin match the physical reality within centimeters.

Key Takeaway

3D computer vision recovers depth and spatial structure from 2D images through stereo matching, structure from motion, learned monocular cues, and neural representations like NeRF and Gaussian splatting, enabling AR, robotics, autonomous driving, and digital twin applications.