← Back to Blog

Understanding the DEFLATE Algorithm: How Gzip Compression Works

10 min read

The DEFLATE algorithm is the foundation of gzip compression, combining two powerful techniques to achieve efficient data compression. Understanding how DEFLATE works helps developers optimize compression settings, troubleshoot compression issues, and appreciate the elegance of lossless data compression. This article explores the inner workings of DEFLATE, from its theoretical foundations to practical implementation details.

What is DEFLATE?

DEFLATE is a lossless data compression algorithm specified in RFC 1951. It combines LZ77 compression (dictionary-based) with Huffman coding (entropy encoding) to achieve high compression ratios. According to the Wikipedia DEFLATE article, it's one of the most widely used compression algorithms, powering gzip, PNG images, and ZIP files.

The algorithm was developed by Phil Katz (creator of ZIP) and later refined by Jean-Loup Gailly and Mark Adler for use in gzip. The GZIP file format (RFC 1952) uses DEFLATE as its compression method, wrapping compressed data with headers and checksums.

LZ77 Compression: Finding Patterns

The first stage of DEFLATE uses LZ77 compression, named after the 1977 paper by Abraham Lempel and Jacob Ziv. According to Wikipedia's LZ77 article, this algorithm works by:

  • Scanning through data to find repeated sequences
  • Replacing repeated sequences with references to earlier occurrences
  • Using a sliding window to limit memory usage
  • Creating a dictionary of previously seen patterns

For example, in the text "the cat sat on the mat", the phrase "the " appears twice. LZ77 would replace the second occurrence with a reference like (distance: 11, length: 4), meaning "go back 11 characters and copy 4 characters".

LZ77 Example:

Original: "the cat sat on the mat"

Compressed: "the cat sat on [11,4]mat"

[11,4] means: go back 11 chars, copy 4 chars

Huffman Coding: Optimal Encoding

After LZ77 compression, DEFLATE applies Huffman coding to further compress the data. Huffman coding is an entropy encoding algorithm that assigns shorter codes to more frequent symbols, reducing the average code length.

The algorithm works by:

  • Counting frequency of each symbol in the data
  • Building a binary tree based on symbol frequencies
  • Assigning shorter codes to frequent symbols
  • Using variable-length codes instead of fixed-length codes

For example, if 'e' appears 100 times and 'z' appears once, 'e' gets a short code like "01" while 'z' gets a longer code like "11110". This reduces the total number of bits needed to represent the data. The RFC 1951 specification defines how DEFLATE constructs and uses Huffman trees.

How DEFLATE Combines Both Techniques

DEFLATE uses a two-stage approach:

  1. LZ77 Stage: Finds and replaces repeated sequences with references
  2. Huffman Stage: Encodes the resulting data using variable-length codes

This combination is powerful because:

  • LZ77 creates patterns (repeated references) that Huffman coding can compress efficiently
  • Huffman coding optimizes the encoding of both literal characters and LZ77 references
  • The two-stage approach achieves better compression than either technique alone

DEFLATE Block Structure

According to RFC 1951, DEFLATE divides data into blocks:

  • Uncompressed blocks: Used when compression doesn't help (rare)
  • Fixed Huffman blocks: Use predefined Huffman trees (faster but less efficient)
  • Dynamic Huffman blocks: Build custom Huffman trees for optimal compression (slower but better)

Most DEFLATE implementations use dynamic Huffman blocks for optimal compression. The GNU Gzip manual explains how gzip chooses between block types based on compression level settings.

Compression Levels Explained

Gzip supports compression levels from 1 to 9, each representing different trade-offs:

  • Level 1: Fast compression, minimal CPU usage, ~60% compression ratio
  • Level 6: Balanced (default), good compression with reasonable speed, ~70% compression ratio
  • Level 9: Maximum compression, slower processing, ~75% compression ratio

Higher levels use more aggressive LZ77 matching and optimize Huffman trees more thoroughly. The RFC 1951 specification allows implementations flexibility in how they achieve compression, leading to different compression level behaviors.

Decompression Process

Decompression reverses the process:

  1. Read Huffman tree from compressed data
  2. Decode Huffman codes back to symbols and LZ77 references
  3. Expand LZ77 references by copying from earlier in the output
  4. Output the reconstructed original data

Decompression is typically faster than compression because it doesn't need to search for patterns or build Huffman trees. The Wikipedia DEFLATE article explains that decompression can be 2-3x faster than compression.

Why DEFLATE is Efficient

DEFLATE achieves excellent compression because:

  • Adaptive: Adjusts to data patterns, building optimal Huffman trees
  • Universal: Works well on many data types (text, code, structured data)
  • Fast: Efficient algorithms make compression and decompression quick
  • Proven: Decades of optimization and real-world use

The algorithm is particularly effective on text-based formats. According to web.dev's compression guide, DEFLATE achieves 60-90% compression on HTML, CSS, JavaScript, and JSON files.

Limitations and Alternatives

While DEFLATE is excellent, it has limitations:

  • Already compressed data: Doesn't compress images, videos, or encrypted data well
  • Small files: Overhead may exceed benefits for very small files
  • Random data: Cannot compress truly random data (no patterns to exploit)

Modern alternatives like Brotli offer better compression ratios (15-20% improvement) but use more CPU. Brotli builds on DEFLATE concepts but adds additional techniques like context modeling.

Practical Applications

DEFLATE powers many technologies:

  • Gzip: File compression and HTTP content encoding
  • PNG: Image format uses DEFLATE for lossless compression
  • ZIP: Archive format uses DEFLATE for file compression
  • HTTP compression: Web servers compress responses using DEFLATE/gzip
  • Git: Version control uses DEFLATE for object compression

Conclusion

The DEFLATE algorithm is a masterpiece of data compression, elegantly combining LZ77 pattern matching with Huffman entropy encoding. Understanding how DEFLATE works helps developers make informed decisions about compression settings, troubleshoot compression issues, and appreciate the sophisticated algorithms that power modern data compression. Whether you're optimizing web performance, compressing files, or working with compressed data formats, DEFLATE's principles are fundamental to efficient data handling.

See DEFLATE in Action

Try our compression tool to see how DEFLATE compresses your files

Try Compression Tool