How to Preprocess Images for AI
Image preprocessing is not optional or cosmetic. Neural networks are mathematical functions that expect input tensors with specific dimensions, value ranges, and channel orderings. Feeding an image with the wrong dimensions causes a crash. Feeding an image with values in the wrong range causes the model to produce garbage outputs. Feeding images with inconsistent preprocessing during training causes the model to waste capacity learning to handle variation that should have been normalized away. Every production vision system, from medical imaging to autonomous driving, has a carefully engineered preprocessing pipeline that ensures consistent, correct input to the model.
Step 1: Resize and Crop to Model Input Size
Neural networks require fixed-size inputs. A ResNet-50 expects 224x224 pixels. YOLOv8 defaults to 640x640. Vision transformers commonly use 224x224 or 384x384. Your preprocessing pipeline must resize every image to match, regardless of its original dimensions. The challenge is that real-world images come in every possible size and aspect ratio: smartphone photos might be 4032x3024, satellite tiles might be 2048x2048, and medical scans might be 512x512.
The simplest approach is to resize the image to the target dimensions, ignoring aspect ratio. This is fast but distorts the image: a 4032x3024 photo resized to 224x224 gets horizontally squished. For some tasks this distortion has minimal impact on accuracy, but for tasks where shape matters (medical imaging, manufacturing inspection), it can cause real problems. The alternative is to resize while preserving aspect ratio and then either crop or pad the result. Resizing so the shorter edge matches the target dimension and then center-cropping preserves the most visual content. Resizing so the longer edge matches and then padding the remaining space with zeros (black borders) or reflected pixels preserves everything but wastes some of the input tensor on non-informative padding.
Interpolation method matters when resizing. Nearest-neighbor interpolation copies the nearest pixel value, producing blocky results when upscaling but preserving sharp edges. Bilinear interpolation computes a weighted average of the 4 nearest pixels, producing smooth results that are the default for most vision frameworks. Bicubic interpolation uses 16 neighboring pixels for higher quality, particularly noticeable when upscaling. Lanczos interpolation uses an even larger neighborhood and produces the sharpest results but is slower. For neural network preprocessing, bilinear interpolation is the standard choice because it balances quality and speed, and because models trained with bilinear-resized inputs expect that specific interpolation at inference time.
For detection and segmentation tasks where objects may appear at any scale, multi-scale preprocessing is common. The image is resized to several different resolutions (say 320x320, 640x640, and 1280x1280), and the model processes each scale separately. Detections from all scales are then merged through non-maximum suppression. This handles the scale variation problem at the cost of processing each image multiple times.
Step 2: Normalize Pixel Values
Raw image pixels are stored as 8-bit unsigned integers with values from 0 to 255 (or 16-bit integers in medical and scientific imaging, with values from 0 to 65535). Neural networks work with 32-bit or 16-bit floating-point numbers and perform best when input values are centered near zero with moderate variance. Normalization converts pixel values from their storage format into the numerical range the model expects.
The most common normalization scales pixel values to the range [0, 1] by dividing by 255. This simple scaling works well for many architectures and is the default in frameworks like TensorFlow and Keras. An alternative scales to [-1, 1] by dividing by 127.5 and subtracting 1, which centers the values around zero. Models trained with one normalization scheme must use the same scheme at inference time; using the wrong normalization is one of the most common deployment bugs in computer vision.
ImageNet normalization is a third approach used by most pre-trained models in the PyTorch ecosystem. After scaling to [0, 1], each color channel is standardized by subtracting its ImageNet training set mean and dividing by its standard deviation. The specific values are: red channel mean 0.485, standard deviation 0.229; green channel mean 0.456, standard deviation 0.224; blue channel mean 0.406, standard deviation 0.225. These values were computed from the 1.2 million images in the ImageNet training set and have become a de facto standard. When using any model pre-trained on ImageNet (which includes most publicly available vision models), applying these exact normalization values is critical for achieving the reported accuracy.
For specialized domains like medical imaging, computing dataset-specific normalization statistics often improves performance. CT scans have a Hounsfield unit scale where values represent tissue density, and windowing (clipping to a specific range and rescaling) is used to emphasize the tissue types of interest: a lung window clips to [-1000, 200] HU while a bone window clips to [-200, 2000] HU. Using the appropriate window for each diagnostic task dramatically affects what the model can learn from the scan.
Step 3: Convert Color Spaces if Needed
Most neural network architectures expect RGB (Red, Green, Blue) input, the standard color representation for displays and web images. However, not all image sources provide RGB data. OpenCV, the most popular computer vision library, loads images in BGR (Blue, Green, Red) order by default, a historical legacy from early video hardware. Feeding BGR images to a model trained on RGB data silently swaps the red and blue channels, causing accuracy to degrade without any error message. This BGR/RGB swap is probably the most common preprocessing bug in computer vision, and every practitioner encounters it at least once.
Some applications benefit from alternative color spaces. HSV (Hue, Saturation, Value) separates color information from brightness, making it useful for tasks where lighting variation is a challenge. A red object has a consistent hue regardless of whether it is brightly or dimly lit, while its RGB values change substantially with illumination. LAB color space separates luminance from color in a perceptually uniform way, meaning that equal numerical distances in LAB space correspond to equal perceived color differences. This property makes LAB useful for color-based quality inspection where subtle color deviations must be detected.
Grayscale conversion discards color information entirely, reducing each pixel from 3 channels to 1 channel. This is appropriate for tasks where color is irrelevant (document scanning, X-ray analysis, many forms of texture inspection) and reduces computation by roughly 3x. The standard conversion formula weights the channels differently: gray = 0.299R + 0.587G + 0.114B, reflecting the human eye's greater sensitivity to green light. Some models, particularly those designed for medical imaging, are architecture-native to single-channel input and expect grayscale data.
Step 4: Reduce Noise and Correct Lighting
Images captured by sensors in challenging conditions often contain noise (random pixel-level variations) and uneven lighting that can confuse neural networks. While deep models are generally robust to moderate noise, severe noise from low-light cameras, high-ISO settings, or electronic interference degrades accuracy measurably. Gaussian blur with a small kernel (3x3 or 5x5) suppresses high-frequency noise at the cost of slight sharpness reduction. Bilateral filtering reduces noise while preserving edges by considering both spatial proximity and intensity similarity, producing cleaner results than Gaussian blur on textured surfaces.
Non-local means denoising, available in OpenCV as fastNlMeansDenoising, compares small patches across the image to find similar regions and averages them, producing high-quality denoising with strong texture preservation. This is computationally expensive but valuable for preprocessing datasets where image quality varies. For production inference pipelines where speed matters, simpler methods like Gaussian blur or the median filter (which is particularly effective against salt-and-pepper noise) are preferred.
Histogram equalization redistributes pixel intensity values to use the full dynamic range, improving contrast in images that are too dark, too bright, or have compressed tonal ranges. Standard histogram equalization operates globally, which can over-amplify noise in some regions. CLAHE (Contrast Limited Adaptive Histogram Equalization) divides the image into tiles and equalizes each tile independently with a contrast limiting parameter, producing more natural-looking results. CLAHE is particularly valuable for medical imaging preprocessing, where subtle contrast differences between tissues carry diagnostic information that global equalization might distort.
White balance correction normalizes the color temperature of images captured under different lighting conditions. Photographs taken under fluorescent light have a green tint, incandescent light produces a warm orange cast, and daylight is relatively neutral. For datasets collected from diverse sources or at different times of day, white balance correction ensures that a white object appears white in all images, reducing a source of variation that the model would otherwise need to learn to handle. The simplest approach divides each channel by the mean value of the brightest pixels, but more sophisticated methods estimate the illuminant using gray-world assumption or color constancy algorithms.
Step 5: Build the Data Loading Pipeline
Preprocessing must be integrated into an efficient data loading pipeline that feeds the model during training and inference. The goal is to ensure the GPU is never waiting for data, because GPU compute is expensive and data starvation wastes it. PyTorch's DataLoader and TensorFlow's tf.data.Dataset both provide multi-process data loading that reads and preprocesses images on CPU cores in parallel while the GPU processes the current batch. Setting the number of worker processes to 4 to 8 typically saturates the data pipeline on modern systems.
On-the-fly preprocessing (applying transformations as images are loaded rather than saving preprocessed copies) is the standard approach for training because it enables data augmentation, where random transformations are applied differently each epoch so the model sees slightly different versions of each training image. Libraries like Albumentations, torchvision.transforms, and imgaug provide composable transformation pipelines that chain resize, normalize, augment, and convert operations into a single efficient function applied to each image as it is loaded.
For inference pipelines processing images at scale, preprocessing must match training exactly. Any difference in resize interpolation, normalization values, color channel ordering, or crop strategy between training and inference will degrade accuracy. The safest approach is to save the preprocessing configuration used during training and load it during inference, ensuring consistency. TorchVision's new Transforms V2 API and ONNX Runtime's pre/post-processing graph embedding both address this by bundling preprocessing logic with the model itself.
Dataset splitting divides available images into training (typically 70-80%), validation (10-15%), and test (10-15%) sets. Splits must be performed before any preprocessing to prevent data leakage, where information from the test set influences the training process. When images come from the same source (multiple frames from the same video, multiple crops from the same slide), all images from a single source must go into the same split. Otherwise, the model may appear to generalize well during validation while actually memorizing source-specific patterns.
Proper image preprocessing requires matching the exact resize method, normalization values, and color space that the model expects, with consistent application between training and inference being more important than which specific preprocessing approach you choose.