Data Compression Algorithms: How Files Get Smaller Without Losing Information
Information Theory Foundations
Claude Shannon's 1948 paper, "A Mathematical Theory of Communication," established the theoretical foundation for data compression. Shannon defined entropy as the minimum average number of bits needed to represent each symbol in a message, given the probability distribution of symbols. A message where all symbols are equally likely has maximum entropy and cannot be compressed. A message where one symbol dominates has low entropy and can be compressed significantly.
For example, a file containing only the letter "A" repeated a million times has very low entropy. Instead of storing one million copies of "A," you can store the instruction "repeat A one million times," a dramatic compression. Real data falls between these extremes. English text has an entropy of roughly 1.0 to 1.5 bits per character, meaning it can theoretically be compressed to about one-fifth of its ASCII representation. Shannon's theory gives us the theoretical limit; compression algorithms try to approach it.
Lossless Compression
Huffman Coding
Huffman coding assigns variable-length binary codes to symbols based on their frequency. Frequent symbols get short codes; rare symbols get long codes. The algorithm builds a binary tree bottom-up: start with each symbol as a leaf node weighted by its frequency, repeatedly merge the two lowest-weight nodes into a new parent, and continue until one tree remains. The path from root to each leaf, turning left for 0 and right for 1, defines the code for that symbol.
Huffman codes are prefix-free, meaning no code is a prefix of another, which allows unambiguous decoding. Huffman coding is provably optimal among prefix-free codes, meaning no other prefix-free code can produce a shorter average encoding for the given symbol distribution. It is used as a final stage in many compression formats, including DEFLATE, JPEG, and MP3.
LZ77 and Dictionary-Based Compression
The Lempel-Ziv 1977 (LZ77) algorithm exploits repeated patterns in data. Instead of encoding each character individually, LZ77 replaces repeated sequences with references to earlier occurrences. A reference consists of a (distance, length) pair: "go back 47 characters and copy 12 characters." This works well on structured data like source code, markup languages, and log files where patterns repeat frequently.
LZ78 and LZW (Lempel-Ziv-Welch) build an explicit dictionary of previously seen patterns and replace subsequent occurrences with dictionary indices. LZW was used in the GIF image format and in the Unix compress utility. These dictionary-based methods adapt to the data as they process it, without requiring a separate analysis pass to compute symbol frequencies.
DEFLATE
DEFLATE combines LZ77 dictionary matching with Huffman coding, using LZ77 to find repeated patterns and Huffman coding to compress the output optimally. Invented by Phil Katz for the PKZIP program in 1993, DEFLATE is the algorithm behind ZIP files, gzip compression, and the PNG image format. Its combination of good compression ratio, reasonable speed, and patent-free status made it the dominant general-purpose compression algorithm for decades.
Modern Lossless Algorithms
Zstandard (zstd), developed by Facebook in 2016, offers significantly better compression speed than DEFLATE at comparable ratios, or better ratios at comparable speed. It uses finite state entropy coding and a more sophisticated dictionary matching approach. Zstandard is increasingly adopted for database compression, package distribution, and real-time data streaming.
Brotli, developed by Google, achieves higher compression ratios than DEFLATE, making it ideal for web content. It uses a large predefined dictionary of common web content strings, combined with LZ77 and Huffman coding. Major web browsers support Brotli for HTTP compression, reducing page load times.
Lossy Compression
Image Compression: JPEG
JPEG compression exploits limitations in human visual perception. It converts the image from RGB to YCbCr color space (separating brightness from color), downsamples the color channels (human eyes are less sensitive to color detail than brightness detail), applies the Discrete Cosine Transform (DCT) to convert spatial data into frequency components, quantizes the frequency coefficients (discarding high-frequency detail that is hard to perceive), and encodes the result with Huffman coding.
The quantization step is where information is lost, and its aggressiveness is controlled by a quality parameter. At high quality, JPEG images are nearly indistinguishable from the original. At low quality, visible artifacts appear, particularly blocky patterns at the boundaries of 8x8 pixel blocks. JPEG is not suitable for text, line art, or images with sharp edges, where PNG (lossless) produces better results.
Audio Compression: MP3 and Beyond
MP3 compression uses a psychoacoustic model to identify sounds that human ears cannot perceive: sounds masked by louder nearby frequencies, sounds below the hearing threshold, and temporal masking effects where a loud sound makes softer sounds before and after it imperceptible. These inaudible components are discarded or represented with fewer bits. At 128 kbps, MP3 achieves roughly 11:1 compression with quality acceptable to most listeners. At 320 kbps, the compression is about 4.4:1 with quality nearly indistinguishable from the uncompressed original.
Newer codecs like AAC (used by Apple), Opus (used for voice and music streaming), and Vorbis (used in games) achieve better quality at lower bitrates through more sophisticated psychoacoustic models and entropy coding. Opus in particular is remarkable for its flexibility, handling both speech and music well across a wide range of bitrates.
Video Compression
Video compression applies spatial compression within each frame (similar to JPEG for still images) and temporal compression between frames. Keyframes (I-frames) are compressed independently. Predicted frames (P-frames) encode only the differences from the previous frame. Bidirectional frames (B-frames) encode differences from both previous and future frames. Since consecutive video frames are typically very similar, temporal prediction achieves enormous compression ratios.
Modern codecs like H.265/HEVC and AV1 achieve roughly 50% better compression than their predecessors at the same quality, enabling 4K and 8K streaming over consumer internet connections. The computational cost of these advanced codecs is high, requiring dedicated hardware encoders in devices and data centers.
Choosing the Right Compression
The choice between lossless and lossy compression depends on the data and the application. Text, source code, executables, and any data where every bit matters must use lossless compression. Photographs, music, and video can use lossy compression because human perception has inherent limitations that compression algorithms can exploit. Medical imaging, scientific data, and archival storage often require lossless compression even for images, because any information loss could affect diagnosis or analysis.
Compression algorithms make modern digital life possible by reducing storage and bandwidth requirements. Lossless compression preserves every bit of data through clever encoding, while lossy compression exploits the limits of human perception to achieve far greater reduction. The internet, streaming media, and digital storage all depend fundamentally on these algorithmic techniques.