Computer Vision Video Analysis
What Makes Video Different from Images
A single video frame is just an image, and any image-based computer vision technique can be applied to individual frames. The power of video analysis comes from exploiting temporal relationships between frames. Motion reveals information that no single frame contains. A person raising their arm looks the same in a still photo whether they are waving hello, throwing a ball, or swatting a fly. Only the temporal sequence of positions distinguishes these actions. A stationary security camera might capture a scene where nothing appears unusual in any individual frame, but the temporal pattern of a person pacing, circling back, and trying door handles constitutes suspicious behavior that only emerges over time.
The computational cost of video analysis is proportionally larger than image analysis. A 30-second clip at 30 frames per second contains 900 frames. If each frame is 1920x1080 pixels with 3 color channels, the raw data is roughly 5.6 gigabytes of uncompressed pixel values. Processing every frame through a deep neural network is expensive, and video datasets are correspondingly large. Kinetics-700, a standard action recognition benchmark, contains 650,000 video clips across 700 action categories. Something-Something V2 contains 220,847 clips of human-object interactions. Training models on these datasets requires substantial GPU infrastructure, typically multiple machines with multiple GPUs each, training for days to weeks.
Optical Flow
Optical flow is the pattern of apparent motion between consecutive video frames, represented as a vector field where each pixel gets a displacement vector indicating where it appears to move. If a car drives from left to right across the frame, the pixels belonging to the car have rightward flow vectors. Background pixels where the car was now reveal the previously occluded scene, and their flow is less predictable. Optical flow captures all motion in the scene simultaneously, including camera motion, object motion, and apparent motion from lighting changes.
Traditional optical flow algorithms like Lucas-Kanade (1981) and Horn-Schunck (1981) solve the flow estimation as an optimization problem, minimizing the difference between a pixel in one frame and its displaced location in the next frame while enforcing smoothness constraints. These methods work well for small, smooth motions but struggle with large displacements, occlusions, and texture-poor regions. FlowNet (2015) applied deep learning to optical flow, training a CNN to predict flow fields directly from frame pairs. RAFT (Recurrent All-Pairs Field Transforms, 2020) achieved state-of-the-art flow estimation by iteratively refining flow predictions using a correlation volume that computes similarity between all pairs of pixels in the two frames.
Optical flow serves as an input representation for other video understanding tasks. Two-stream networks, introduced in 2014, process raw RGB frames through one CNN and pre-computed optical flow through a second CNN, then fuse their predictions for action recognition. The RGB stream captures appearance information (what objects are present) while the flow stream captures motion information (how things are moving). This separation is biologically motivated: the human visual cortex has separate pathways for processing form (the ventral stream) and motion (the dorsal stream). Two-stream architectures consistently outperform single-stream models, demonstrating that explicitly representing motion is valuable for temporal understanding.
Action Recognition
Action recognition classifies what activity is happening in a video clip. Given a 3 to 10 second clip, the model outputs a label: "playing basketball," "cooking," "brushing teeth," "handshaking," or one of hundreds of other categories. This is the temporal analog of image classification but substantially harder because the model must understand how visual patterns evolve over time, not just what they look like in a single instant. The difference between "opening a door" and "closing a door" is the direction of motion. The difference between "pouring water" and "drinking water" involves the spatial relationship between objects across multiple frames.
3D convolutional networks extend 2D CNNs to the temporal dimension by using 3D filters that process spatiotemporal volumes. C3D (2015) applied 3x3x3 convolutions to video clips, capturing local spatiotemporal patterns. I3D (Inflated 3D ConvNets, 2017) inflated the filters of a pre-trained 2D ImageNet model into 3D, effectively initializing the temporal dimension with spatial knowledge. This transfer from image pre-training to video was remarkably effective, achieving state-of-the-art results on Kinetics while requiring much less video training data than training from scratch. SlowFast Networks (2019) used two pathways operating at different temporal resolutions: a slow pathway processes frames at low frame rate to capture spatial semantics, while a fast pathway processes frames at high frame rate to capture rapid motion dynamics.
Video transformers apply self-attention across both spatial and temporal dimensions. TimeSformer (2021) factorized attention into separate spatial and temporal components: each frame's patches attend to other patches within the same frame (spatial attention), then patches at the same spatial position attend to corresponding patches in other frames (temporal attention). This factorization reduces the quadratic cost of joint spatiotemporal attention while maintaining the ability to capture long-range temporal dependencies. ViViT (Video Vision Transformer) and VideoMAE use similar strategies, with VideoMAE achieving strong results through masked autoencoder pre-training where the model learns to reconstruct randomly masked spatiotemporal patches.
Object Tracking
Object tracking follows specific objects across consecutive video frames, maintaining consistent identity assignments as objects move, change appearance, overlap with each other, and temporarily disappear behind occluding objects. Tracking is essential for autonomous driving (following the trajectory of each nearby vehicle and pedestrian), surveillance (tracking a person of interest through a camera network), sports analytics (following each player's movement), and video editing (applying effects to a specific moving object).
Single-object tracking (SOT) follows one specified target throughout a video, given its location in the first frame. The tracker must handle changes in the target's appearance due to rotation, deformation, scale change, and varying illumination. SiamFC (2016) introduced the Siamese network approach: a CNN encodes both the target template (from the first frame) and the search region (in the current frame) into feature maps, then cross-correlation finds where the template best matches within the search region. Subsequent Siamese trackers like SiamRPN, SiamMask, and SiamBAN added bounding box regression, mask prediction, and improved training strategies. Transformer-based trackers like TransT and MixFormer apply cross-attention between the template and search features, achieving more robust tracking through long videos.
Multi-object tracking (MOT) simultaneously tracks all objects of interest in a scene, maintaining unique identity assignments. The dominant paradigm is tracking-by-detection: run an object detector on each frame, then associate detections across frames using motion prediction and appearance matching. SORT (Simple Online and Realtime Tracking, 2016) used Kalman filters to predict each tracked object's position in the next frame and the Hungarian algorithm to match predictions to detections. DeepSORT added a deep appearance feature to improve identity matching when objects overlap or temporarily disappear. ByteTrack (2022) achieved state-of-the-art MOT performance by using both high-confidence and low-confidence detections, recovering tracked objects that the detector is uncertain about rather than losing them.
Applications of Video Analysis
Surveillance and security is the oldest and most widespread video analysis application. Automated systems monitor thousands of camera feeds simultaneously, detecting events that require human attention: unauthorized entry, abandoned objects, crowd formation, fights, falls, and traffic violations. Anomaly detection systems learn the normal patterns of activity in a scene and flag deviations, enabling a single security operator to monitor hundreds of cameras by reviewing only flagged events. The technology raises significant privacy concerns, particularly when combined with facial recognition, but its ability to detect genuine security threats at scale is the primary driver of its widespread deployment.
Sports analytics uses video analysis to track every player's position and movement throughout a game, measure speed and acceleration, classify actions (passes, shots, tackles), and generate tactical visualizations. The NBA's Second Spectrum system tracks player and ball positions at 25 frames per second using ceiling-mounted cameras, producing real-time analytics that coaches use for in-game decisions and post-game analysis. Soccer analytics systems track 22 players simultaneously, computing metrics like expected goals, pressing intensity, and passing network complexity. Broadcasting uses real-time tracking to overlay graphics showing player speeds, distances, and tactical formations.
Manufacturing uses video analysis for process monitoring that goes beyond static quality inspection. A camera watching an assembly line can detect when an operator skips a step, applies incorrect force, or uses the wrong tool. Ergonomic analysis systems track worker posture and repetitive motions to identify injury risks before they cause harm. Robot guidance systems use real-time video analysis to coordinate with human workers in shared spaces, ensuring safe collaboration. These temporal analysis capabilities provide insights that single-frame inspection cannot: the process matters, not just the final product.
Content understanding applies video analysis to entertainment, education, and media. Automatic video captioning generates text descriptions of what happens in a video. Video summarization condenses long videos into highlights by identifying the most important or interesting segments. Content moderation systems detect policy violations in user-uploaded video at the scale of platforms like YouTube and TikTok, which receive hundreds of hours of new video every minute. Accessibility tools generate audio descriptions of visual content for visually impaired viewers. These applications bridge the gap between visual content and text-based search, making the information in billions of hours of video searchable and accessible.
Video analysis adds temporal understanding to computer vision through optical flow, 3D convolutions, and video transformers, enabling action recognition, object tracking, and event detection that are impossible from single frames alone.