VLQ Encoding: The Technical Foundation of Source Maps

Optimizing Data Density Through Variable-Length Encoding Strategies

Mar 12, 2025

Binary integer representation often comes with tradeoffs between size and complexity. Variable-length quantity (VLQ) encoding optimizes these tradeoffs in systems where integers have non-uniform distribution patterns, particularly when smaller values dominate the dataset. While source maps for debugging transpiled code offer the most visible application, VLQ's elegant design patterns appear throughout critical infrastructure - from MIDI file formats to Protocol Buffers and beyond.

VLQ Encoding: Technical Principles

At its core, VLQ encoding is a binary serialization technique that optimizes integer representation by using a variable number of bytes that scales with the magnitude of the encoded value. In contrast to fixed-width encodings that allocate constant space regardless of value, VLQ applies a simple principle: smaller values deserve smaller encodings.

The technical implementation of VLQ encoding revolves around a clever bit manipulation strategy:

Integer to binary conversion: The numeric value is represented in binary form.
Sign handling: For signed integers, a zigzag encoding is typically employed. This maps signed integers to unsigned integers in a way that small negative and small positive numbers get mapped to small unsigned numbers. The sequence for small integers is 0, -1, 1, -2, 2, ... which gets mapped to 0, 1, 2, 3, 4, ....
Chunking: The binary representation is segmented into fixed-size groups (typically 7 data bits for standard VLQ, 5 data bits for Base64 VLQ). These chunks are actually 8 bits and 6 bits respectively including the continuation bit.
Continuation bit management: Each group is prefixed with a continuation bit - set to 1 if additional groups follow, or 0 for the terminal group.
Base64 transformation: In the case of source maps, each 6-bit value (5 data bits + 1 continuation bit) is encoded to a Base64 character.

This approach optimizes for the common case - small integers - while gracefully handling larger values at the cost of additional bytes.

Quantifying the Efficiency Gain

To appreciate the efficiency of VLQ encoding, let's compare it with fixed-width encoding across a range of values:

Value | Binary             | Fixed-Width  | Base64 VLQ     | Bytes 
      | Representation     | (32-bit)     | Encoding       | Saved
------|--------------------|--------------|----------------|-------
7     | 0000 0111          | 4 bytes      | 1 byte ("H")   | 75%
42    | 0010 1010          | 4 bytes      | 2 bytes ("vC") | 50%
150   | 1001 0110          | 4 bytes      | 2 bytes ("WC") | 50%
10000 | 0011 1110 1010 0000| 4 bytes      | 3 bytes ("gxC")| 25%

As shown in the table, small integers (which dominate in many practical applications including source maps) see the most significant reduction in size. In real-world scenarios where integer distributions are heavily skewed toward smaller values, the aggregate savings can be substantial:

A typical source map with 10,000 mapping entries might contain 50,000 integers
If 80% of these are small values (under 128), VLQ encoding could reduce the total size by approximately 50-60%.
For a source map that would be 200KB with fixed-width encoding, this translates to savings of around 100KB to possibly up to 120KB depending on the exact data distribution.

This efficiency becomes particularly important when multiplied across the hundreds of source maps that might exist in a large-scale web application.

Technical Deep Dive: VLQ Encoding Implementation

Let's encode 42 using Base64 VLQ to illustrate the implementation details. Source maps use a specific VLQ variant that processes the least significant bits first:

Decimal: 42
Binary: 00101010

1. For source map VLQ, we use a Little-Endian approach (least significant bits first)

2. Split into 5-bit groups, starting from the least significant bits:
   00101010 becomes groups [10101] (least significant 5 bits) and [010] (remaining higher bits).
   
3. Since we're encoding in Little-Endian order, the lower-order bits come first:
   First group: 10101 (lower 5 bits of 42)
   Second group: 010 (higher 3 bits of 42)
   
4. Form 6-bit groups by adding continuation bits (1 if more groups follow, 0 for the last group) as the *most significant bit* of each group.
   Starting with the first group [10101]: since there is more data [010], set the continuation bit to 1.
   So, the first 6-bit group is [1 10101] (binary 110101).
   For the second group [010], as it's the last group, set the continuation bit to 0.
   The second 6-bit group is [0 00010] (binary 000010).
   
5. Convert 6-bit values to Base64. **Important:** Ensure you are using the correct Base64 index mapping (standard Base64 alphabet).
   * [1 10101] (binary 110101, decimal 53) → **'v'** (Base64 index 53)
   * [0 00010] (binary 000010, decimal 2) → **'C'** (Base64 index 2)

Result: "vC"

Note that VLQ implementations vary significantly. Source maps use a specific implementation where:

The least significant bits come first (Little-Endian)
The continuation bit is the most significant bit of each 6-bit group
A continuation bit of 1 means "more groups follow"

For negative values, we first apply zigzag encoding. For example, to encode -42:

1. Zigzag encoding: -42 → 83 
   Using formula: (n << 1) ^ (n >> 31) for 32-bit integers
   For -42: (-42 << 1) = -84
            (-42 >> 31) = -1 (all ones, as arithmetic right shift preserves sign)
            -84 ^ -1 = 83
   
2. Now encode 83 using the same VLQ process:
   Binary: 01010011
   Split into 5-bit groups: [01001] [01]
   Add continuation bits: [1 01001], [0 0001]
   Base64 encode:
   * [1 01001] (binary 101001, decimal 41) -> **'p'** (Base64 index 41)
   * [0 00001] (binary 000001, decimal 1) -> **'B'** (Base64 index 1)

Result: "pB"

The zigzag encoding maps negative numbers to positive odd numbers and positive numbers to even numbers, creating a sequence: 0, -1, 1, -2, 2, -3, 3, ... which becomes 0, 1, 2, 3, 4, 5, 6, ... after encoding. This ensures that small negative numbers also get small encodings.

Source Maps: VLQ in Production

Source maps contain a mappings field - a dense VLQ-encoded string representing the mapping between generated and original source positions. Each mapping segment encodes up to five values:

Generated column number (always present)
Source index (optional)
Original line number (optional)
Original column number (optional)
Name index (optional)

These values are encoded as relative offsets from previous values, further enhancing compression. Here's a fragment of a source map's mappings string:

"AAAA,SAASA,cAAcC,MAAO,CACxB,IAAIC,MAAQ,EACZ"

Let's decode a small part of this to understand how it works:

"AAAA" decodes to [0] - This indicates column 0 of the generated file
"," is the segment separator
"SAASA" does not decode to [9,0,0,9]. It actually decodes to [0, 0, 0, 0, 1]. This represents a relative change to the previous mapping entry.

When fully decoded, this provides precise position mapping between transformed code and the original source code, enabling accurate debugging in production environments. The brilliance of this approach is that the encoding size scales with the complexity of the mapping, not with the file size.

Let’s dig into this a bit. When a TypeScript file undergoes complex transformations during compilation, the resulting source map needs to be larger and more detailed for several important reasons.

Complex TypeScript features generate more complex JavaScript output. Consider these examples:

When you use advanced TypeScript features like decorators, each decorator might generate multiple helper functions and runtime checks in the JavaScript output. A single line of TypeScript with several decorators could expand into dozens of lines of JavaScript code. To maintain accurate debugging, the source map needs to track how each piece of that generated JavaScript maps back to the original decorator line.

TypeScript generics are another example. Since JavaScript doesn't have native generics, TypeScript must transform generic code into JavaScript that performs type checking and validation at runtime. This transformation creates more complex relationships between the source and generated code that the source map must accurately represent.

Class inheritance in TypeScript with features like private fields, protected methods, and abstract classes generates significant JavaScript code to emulate these features. Each generated method and property access might include additional checks that weren't explicitly written in your TypeScript, and the source map needs to track these relationships.

Async/await transformations are particularly complex. A simple async function in TypeScript becomes a state machine in JavaScript with multiple code paths, promise chains, and error handling logic. The source map must maintain the connection between each part of this state machine and the original concise async function.

The key insight is that more complex TypeScript code doesn't just generate more JavaScript code—it generates JavaScript with less obvious, more intricate connections back to the original source. Each of these non-obvious connections requires additional mapping information in the source map.

TypeScript's compilation process heavily relies on this mechanism. When TypeScript code is transpiled to JavaScript, the TypeScript compiler generates these VLQ-encoded source maps to maintain the crucial relationship between the TypeScript source and the resulting JavaScript. This allows developers to set breakpoints and debug directly in their TypeScript files, even though the browser is actually executing the compiled JavaScript. Without VLQ's efficient encoding, these source maps would be prohibitively large for complex applications, potentially adding megabytes of debugging metadata that would impact load times and performance.

Protocol Buffers: VLQ's Close Cousin

Google's Protocol Buffers use a variant of VLQ called varint encoding. While conceptually similar, varint employs 7-bit chunks with continuation bits, as opposed to the 5-bit chunks used in source maps. This choice optimizes for machine parsing efficiency rather than for Base64 compatibility.

Here's how Protocol Buffers encodes the integer 150:

Decimal: 150
Binary: 10010110

1. Group into 7-bit chunks (least significant bits first):
   [0010110] [0000001]
   
2. Add continuation bits (in the most significant bit position):
   [10010110] [00000001]
   First byte: 10010110 (0x96) - MSB is 1, indicating more bytes follow
   Second byte: 00000001 (0x01) - MSB is 0, indicating this is the final byte

Result: Byte sequence 0x96 0x01

In a practical Protocol Buffer message definition:

message SensorReading {
  optional int32 timestamp_seconds = 1;
  optional int32 value = 2;
}

The field values are encoded using varint, making the serialized representation particularly efficient for small values. For instance, a timestamp of 1615300000 (March 9, 2021) would take 4 bytes using varint encoding, versus 4 bytes with fixed-width encoding - seemingly equivalent until you consider that a value of 100 would take just 1 byte with varint but still 4 bytes with fixed-width encoding.

The implementation in C++ looks like this:

// Encoding a varint
char buffer[10];  // 10 bytes is enough for any 64-bit value
char* end = EncodeVarint32(buffer, value);
size_t size = end - buffer;  // Size of the encoded value

// From the Protocol Buffers codebase
char* EncodeVarint32(char* dst, uint32_t v) {
  // Operate on characters as unsigneds
  unsigned char* ptr = reinterpret_cast<unsigned char*>(dst);
  static const int B = 128;
  if (v < (1<<7)) {
    *(ptr++) = v;
  } else if (v < (1<<14)) {
    *(ptr++) = v | B;
    *(ptr++) = v>>7;
  } else if (v < (1<<21)) {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = v>>14;
  } else if (v < (1<<28)) {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = (v>>14) | B;
    *(ptr++) = v>>21;
  } else {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = (v>>14) | B;
    *(ptr++) = (v>>21) | B;
    *(ptr++) = v>>28;
  }
  return reinterpret_cast<char*>(ptr);
}

This implementation reveals an important optimization: by handling common cases with fewer operations rather than using a general loop, Protocol Buffers prioritizes encoding/decoding speed for the most frequent value ranges.

MIDI Files: VLQ in Audio Engineering

MIDI file format uses a variation of VLQ for encoding delta times between events. This application is particularly interesting because it highlights how VLQ can be adapted to domain-specific requirements.

A standard MIDI delta time is encoded with 7-bit chunks, where the most significant bit in each byte indicates whether another byte follows. For instance, a delta time of 0x7F (127) would be encoded as a single byte 0x7F, while a value of 0x80 (128) would require two bytes: 0x81 0x00.

Delta time: 0x3FFF (16383)
Binary: 0011 1111 1111 1111

1. Group into 7-bit chunks (least significant bits first):
   [1111111] [0111111] [0000000]
   
2. Add continuation bits (in the most significant bit position):
   [11111111] [10111111] [00000000]

Result: Byte sequence 0xFF 0xBF 0x00

This design choice ensures that small time deltas, which are most common in MIDI sequences, receive the most efficient encoding.

Engineering Implications

The efficiency of VLQ encoding has profound implications across multiple engineering disciplines:

Build pipeline optimization: Build tools for any compiled or transpiled language can leverage VLQ-encoded metadata to minimize debugging information overhead while maintaining full fidelity.
Network payload reduction: In any networked application, variable-length encoding reduces bandwidth consumption for integer-heavy data, particularly when value distributions are non-uniform.
Parser implementation complexity: Tools that decode VLQ-encoded data must balance decoding efficiency with memory usage, often employing lookup tables and specialized bit manipulation.
Storage considerations: Database systems, file formats, and caching mechanisms all benefit from VLQ's space efficiency when storing large collections of integers with heterogeneous magnitudes.

Beyond Source Maps: Broader Applications

The principles behind VLQ encoding extend beyond source maps and are worth considering in your own engineering work:

Custom binary protocols: When designing communication protocols where integer values have non-uniform distributions, VLQ-style encoding can significantly reduce payload size.
In-memory data structures: For applications with memory constraints, applying VLQ encoding to internal data structures can reduce memory pressure.
Time-series data: Similar to MIDI's use case, any sequential data where deltas between values tend to be small can benefit from VLQ encoding.

Conclusion

VLQ encoding exemplifies how seemingly minor implementation details can have outsized impacts on system performance, storage efficiency, and overall engineering quality. Understanding these foundational techniques provides both practical insights for daily work and a deeper appreciation for the elegant solutions hiding in plain sight throughout our technical infrastructure.

Whether you're designing a new binary protocol, optimizing a database storage engine, or simply debugging code through source maps, the principles of VLQ encoding demonstrate that careful attention to data representation fundamentals continues to yield significant benefits across all domains of software engineering.

A Peek Under the Hood

I use Claude and other LLMs as thinking partners while writing these posts - they help me refine examples and articulate complex concepts. I rigorously fact-check the technical content. Any mistakes you spot are definitely on me - drop a comment and I'll update the post.

How are you leveraging variable-length encoding techniques in your own engineering projects? Share your experiences in the comments.

Dushyanth’s Substack

Discussion about this post

Ready for more?