Vector Quantization for LLM weights: a real 25% win at 3 bits β€” and why every cheaper number is a lie

An honest sweep of VQ vs scalar quantization, validated the only way that counts: stacked real-data perplexity.

πŸ“Š Companion agentic-LLM leaderboard, same "measure honestly" philosophy: rab.utopiaia.com β€” RAB, the Real-world Agentic Benchmark.

TL;DR

  • At ~3 bits/weight, vector quantization (VQ) keeps a model essentially intact (perplexity 1.11Γ— baseline) where ordinary scalar quantization at the same bitrate is destroyed (5.36Γ—). Scalar needs a full extra bit (4 b/w) to match VQ's 3-bit quality β€” i.e. VQ is ~25% smaller than Q4 at equal quality.
  • Below 3 bits, naΓ―ve VQ collapses (2-bit VQ β†’ 12Γ— perplexity). The sub-1-bit and 2-bit dreams are not reachable without calibration + error feedback β€” at which point you have simply reimplemented AQLM / QuIP#.
  • The headline methodological result: weight-MSE and single-tensor output-fidelity both lie. A 2-bit VQ config that looked near-lossless on per-tensor cosine (0.991) was 12Γ— worse on real perplexity. Only stacked, real-data perplexity predicts model quality. If you are evaluating a quantizer on per-tensor error, you are measuring the wrong thing.

The idea

Scalar quantization rounds each weight independently to a low-bit grid (RTN), with a shared scale per block. Vector quantization instead groups each weight matrix into sub-vectors of length dim, clusters them into K shared centroids (a codebook), and stores one index per sub-vector. The cost is:

bits/weight = log2(K) / dim     (+ a small codebook + scale overhead)

So dim=4, K=4096 β†’ 12/4 = 3 bits/weight; dim=4, K=256 β†’ 8/4 = 2 bits; dim=16, K=4096 β†’ 0.77 bits. The promise: a codebook captures correlations between adjacent weights that scalar rounding throws away β€” so the same bit budget should buy more fidelity.

That promise is real, but only in a narrow band, and only if you measure it correctly.

Measuring it correctly

We evaluated three ways, in increasing order of honesty:

  1. Weight MSE β€” reconstruction error of the matrix itself. Cheap, and misleading: it has no idea which errors the forward pass amplifies.
  2. Single-tensor output fidelity β€” feed random inputs x through W and W_recon, measure output cosine. Better in principle, but with Gaussian inputs it still lied to us (see below).
  3. Stacked real-data perplexity β€” quantize every layer, run the whole model on real text, measure perplexity. This is the only metric that tracks what a user would actually feel.

The trap is error accumulation. Per-tensor metrics look at one layer in isolation. But a 30-layer model stacks 30 small distortions, each feeding the next; a per-layer cosine of 0.99 does not mean a stacked cosine of 0.99. The accumulation is the whole story, and only metric (3) sees it.

The sweep

Stacked real perplexity on a 2B base model (baseline ppl 1.89):

method bits/w perplexity ratio vs baseline
scalar 2-bit 3.00 10.14 5.36Γ—
scalar 3-bit 4.00 1.97 1.04Γ—
scalar 4-bit 5.00 1.91 1.01Γ—
VQ d4 K4096 3.02 2.10 1.11Γ—
VQ d4 K256 2.00 22.80 12.04Γ—
VQ d2 K256 4.00 2.03 1.07Γ—

(Scalar "N-bit" carries an extra ~1 b/w of block-scale overhead, hence the bits/w column.)

Read the table at iso-bitrate:

  • At ~3 b/w, VQ (1.11Γ—) crushes scalar (5.36Γ—). Scalar has to spend a full extra bit to catch up. This is the win: VQ-3bit β‰ˆ scalar-4bit quality at 25% less size.
  • At ~4 b/w they tie (scalar marginally better, and ~1000Γ— faster to compute). Above 3 bits there is no reason to pay VQ's encoding cost.
  • At 2 b/w naΓ―ve VQ is broken (12Γ—) β€” exactly like naΓ―ve scalar at 2 bits is broken. This is the universal "RTN-at-2-bits destroys models, which is why GPTQ exists" result, and VQ does not escape it for free.

The mirage we chased and killed

Per-tensor output fidelity made dim=4, K=256 @ 2 bits look like a free lunch: output cosine 0.991, worst-case 0.98 β€” visually indistinguishable from the original at half the bits. Stacked, it is 12Γ— worse perplexity. That single result is the most important thing in this writeup: we nearly shipped a 2-bit config on the strength of a per-tensor number, and it would have produced garbage.

We also tried to rescue sub-1-bit fidelity with a sparse high-precision residual on the worst sub-vectors (the natural "fix the outliers" move). Refuted. VQ error is diffuse β€” spread thinly across many sub-vectors, not concentrated in a few β€” so a sparse correction plus its locator bitmap wastes the budget: reaching 0.97 fidelity sub-1-bit costs more total bits than just using dim=4, K=256 @ 2 bits. The cheap-and-good permutation and the expensive-and-good one are never the same one. (The identical trap sinks "reorder weights to compress" schemes.)

Where the real 2-bit lives

Getting to a usable 2 bits is not impossible β€” it just isn't free. It requires calibration data, per-channel scales, and GPTQ-style error feedback that pushes each layer's quantization error into the next layer's weights. That is precisely the machinery of AQLM and QuIP#. The honest framing: the 3-bit VQ result stands on its own as a solid, shippable compression win; the 2-bit result is a research program, not a config flag.

Practical takeaways

  1. If you quantize, evaluate on stacked real perplexity. Per-tensor MSE and single-tensor cosine will happily approve a model that generates noise.
  2. 3-bit VQ is a genuine ~25% saving over 4-bit scalar at matched quality. If you are RAM-bound and already at Q4, VQ-3bit is the next real step down.
  3. Don't believe sub-2-bit claims that aren't backed by stacked perplexity and an error-feedback method. The bits/weight number is the easy part; keeping the model alive at that bitrate is the hard part.

Methodology and code are public. Reconstruction, packing (12-bit index bit-packing β†’ true 3 b/w on disk) and the stacked-perplexity harness are in the repository. Model training details are out of scope here β€” this is a study of weight quantization on public base checkpoints.


Repository contents (code)

This repo is the method + code, reproducible on any public base checkpoint. No model weights are shipped here (the on-disk .vqz format is non-standard; a standard-format model card is a separate effort).

file what it does
vq_sweep.py per-tensor VQ bitrate sweep + output fidelity (the metric that turned out to lie)
vq_stacked_sweep.py the definitive result: loads the model once, restores+quantizes per config, measures real stacked perplexity
vq_stacked_ppl.py single-config stacked perplexity harness
vq_residual_lever.py sub-1-bit residual lever β€” refuted (VQ error is diffuse, sparse correction wastes the budget)
vq_pack.py produces a real on-disk artifact (~3 b/w) via 12-bit index bit-packing; round-trip verified
vq_load.py reconstructs a model from the packed artifact and generates text to verify coherence
vq_compress.py core VQ encode/decode primitives

Reproduce the headline table

pip install -r requirements.txt
python vq_stacked_sweep.py            # stacked real perplexity, VQ vs scalar, per config

The 3-bit VQ win (β‰ˆ25% smaller than 4-bit scalar at matched perplexity) and the "per-tensor metrics lie" result both fall out of this single script.

Study of weight quantization on public base checkpoints. Model training details are out of scope.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support