Vector Quantization for LLM weights: a real 25% win at 3 bits — and why every cheaper number is a lie

An honest sweep of VQ vs scalar quantization, validated the only way that counts: stacked real-data perplexity.

📊 Companion agentic-LLM leaderboard, same "measure honestly" philosophy: rab.utopiaia.com — RAB, the Real-world Agentic Benchmark.

TL;DR

At ~3 bits/weight, vector quantization (VQ) keeps a model essentially intact (perplexity 1.11× baseline) where ordinary scalar quantization at the same bitrate is destroyed (5.36×). Scalar needs a full extra bit (4 b/w) to match VQ's 3-bit quality — i.e. VQ is ~25% smaller than Q4 at equal quality.
Below 3 bits, naïve VQ collapses (2-bit VQ → 12× perplexity). The sub-1-bit and 2-bit dreams are not reachable without calibration + error feedback — at which point you have simply reimplemented AQLM / QuIP#.
The headline methodological result: weight-MSE and single-tensor output-fidelity both lie. A 2-bit VQ config that looked near-lossless on per-tensor cosine (0.991) was 12× worse on real perplexity. Only stacked, real-data perplexity predicts model quality. If you are evaluating a quantizer on per-tensor error, you are measuring the wrong thing.

The idea

Scalar quantization rounds each weight independently to a low-bit grid (RTN), with a shared scale per block. Vector quantization instead groups each weight matrix into sub-vectors of length dim, clusters them into K shared centroids (a codebook), and stores one index per sub-vector. The cost is:

bits/weight = log2(K) / dim     (+ a small codebook + scale overhead)

So dim=4, K=4096 → 12/4 = 3 bits/weight; dim=4, K=256 → 8/4 = 2 bits; dim=16, K=4096 → 0.77 bits. The promise: a codebook captures correlations between adjacent weights that scalar rounding throws away — so the same bit budget should buy more fidelity.

That promise is real, but only in a narrow band, and only if you measure it correctly.

Measuring it correctly

We evaluated three ways, in increasing order of honesty:

Weight MSE — reconstruction error of the matrix itself. Cheap, and misleading: it has no idea which errors the forward pass amplifies.
Single-tensor output fidelity — feed random inputs x through W and W_recon, measure output cosine. Better in principle, but with Gaussian inputs it still lied to us (see below).
Stacked real-data perplexity — quantize every layer, run the whole model on real text, measure perplexity. This is the only metric that tracks what a user would actually feel.

The trap is error accumulation. Per-tensor metrics look at one layer in isolation. But a 30-layer model stacks 30 small distortions, each feeding the next; a per-layer cosine of 0.99 does not mean a stacked cosine of 0.99. The accumulation is the whole story, and only metric (3) sees it.

The sweep

Stacked real perplexity on a 2B base model (baseline ppl 1.89):

method	bits/w	perplexity	ratio vs baseline
scalar 2-bit	3.00	10.14	5.36×
scalar 3-bit	4.00	1.97	1.04×
scalar 4-bit	5.00	1.91	1.01×
VQ d4 K4096	3.02	2.10	1.11×
VQ d4 K256	2.00	22.80	12.04×
VQ d2 K256	4.00	2.03	1.07×

(Scalar "N-bit" carries an extra ~1 b/w of block-scale overhead, hence the bits/w column.)

Read the table at iso-bitrate:

At ~3 b/w, VQ (1.11×) crushes scalar (5.36×). Scalar has to spend a full extra bit to catch up. This is the win: VQ-3bit ≈ scalar-4bit quality at 25% less size.
At ~4 b/w they tie (scalar marginally better, and ~1000× faster to compute). Above 3 bits there is no reason to pay VQ's encoding cost.
At 2 b/w naïve VQ is broken (12×) — exactly like naïve scalar at 2 bits is broken. This is the universal "RTN-at-2-bits destroys models, which is why GPTQ exists" result, and VQ does not escape it for free.

The mirage we chased and killed

Per-tensor output fidelity made dim=4, K=256 @ 2 bits look like a free lunch: output cosine 0.991, worst-case 0.98 — visually indistinguishable from the original at half the bits. Stacked, it is 12× worse perplexity. That single result is the most important thing in this writeup: we nearly shipped a 2-bit config on the strength of a per-tensor number, and it would have produced garbage.

We also tried to rescue sub-1-bit fidelity with a sparse high-precision residual on the worst sub-vectors (the natural "fix the outliers" move). Refuted. VQ error is diffuse — spread thinly across many sub-vectors, not concentrated in a few — so a sparse correction plus its locator bitmap wastes the budget: reaching 0.97 fidelity sub-1-bit costs more total bits than just using dim=4, K=256 @ 2 bits. The cheap-and-good permutation and the expensive-and-good one are never the same one. (The identical trap sinks "reorder weights to compress" schemes.)

Where the real 2-bit lives

Getting to a usable 2 bits is not impossible — it just isn't free. It requires calibration data, per-channel scales, and GPTQ-style error feedback that pushes each layer's quantization error into the next layer's weights. That is precisely the machinery of AQLM and QuIP#. The honest framing: the 3-bit VQ result stands on its own as a solid, shippable compression win; the 2-bit result is a research program, not a config flag.

Practical takeaways

If you quantize, evaluate on stacked real perplexity. Per-tensor MSE and single-tensor cosine will happily approve a model that generates noise.
3-bit VQ is a genuine ~25% saving over 4-bit scalar at matched quality. If you are RAM-bound and already at Q4, VQ-3bit is the next real step down.
Don't believe sub-2-bit claims that aren't backed by stacked perplexity and an error-feedback method. The bits/weight number is the easy part; keeping the model alive at that bitrate is the hard part.

Methodology and code are public. Reconstruction, packing (12-bit index bit-packing → true 3 b/w on disk) and the stacked-perplexity harness are in the repository. Model training details are out of scope here — this is a study of weight quantization on public base checkpoints.

Repository contents (code)

This repo is the method + code, reproducible on any public base checkpoint. No model weights are shipped here (the on-disk .vqz format is non-standard; a standard-format model card is a separate effort).

file	what it does
`vq_sweep.py`	per-tensor VQ bitrate sweep + output fidelity (the metric that turned out to lie)
`vq_stacked_sweep.py`	the definitive result: loads the model once, restores+quantizes per config, measures real stacked perplexity
`vq_stacked_ppl.py`	single-config stacked perplexity harness
`vq_residual_lever.py`	sub-1-bit residual lever — refuted (VQ error is diffuse, sparse correction wastes the budget)
`vq_pack.py`	produces a real on-disk artifact (~3 b/w) via 12-bit index bit-packing; round-trip verified
`vq_load.py`	reconstructs a model from the packed artifact and generates text to verify coherence
`vq_compress.py`	core VQ encode/decode primitives

Reproduce the headline table

pip install -r requirements.txt
python vq_stacked_sweep.py            # stacked real perplexity, VQ vs scalar, per config

The 3-bit VQ win (≈25% smaller than 4-bit scalar at matched perplexity) and the "per-tensor metrics lie" result both fall out of this single script.

Study of weight quantization on public base checkpoints. Model training details are out of scope.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support