Instructions to use KikoCis/vq-llm-compression with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use KikoCis/vq-llm-compression with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir vq-llm-compression KikoCis/vq-llm-compression
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Vector Quantization for LLM weights: a real 25% win at 3 bits β and why every cheaper number is a lie
An honest sweep of VQ vs scalar quantization, validated the only way that counts: stacked real-data perplexity.
π Companion agentic-LLM leaderboard, same "measure honestly" philosophy: rab.utopiaia.com β RAB, the Real-world Agentic Benchmark.
TL;DR
- At ~3 bits/weight, vector quantization (VQ) keeps a model essentially intact (perplexity 1.11Γ baseline) where ordinary scalar quantization at the same bitrate is destroyed (5.36Γ). Scalar needs a full extra bit (4 b/w) to match VQ's 3-bit quality β i.e. VQ is ~25% smaller than Q4 at equal quality.
- Below 3 bits, naΓ―ve VQ collapses (2-bit VQ β 12Γ perplexity). The sub-1-bit and 2-bit dreams are not reachable without calibration + error feedback β at which point you have simply reimplemented AQLM / QuIP#.
- The headline methodological result: weight-MSE and single-tensor output-fidelity both lie. A 2-bit VQ config that looked near-lossless on per-tensor cosine (0.991) was 12Γ worse on real perplexity. Only stacked, real-data perplexity predicts model quality. If you are evaluating a quantizer on per-tensor error, you are measuring the wrong thing.
The idea
Scalar quantization rounds each weight independently to a low-bit grid (RTN), with a shared scale per block. Vector quantization instead groups each weight matrix into sub-vectors of length dim, clusters them into K shared centroids (a codebook), and stores one index per sub-vector. The cost is:
bits/weight = log2(K) / dim (+ a small codebook + scale overhead)
So dim=4, K=4096 β 12/4 = 3 bits/weight; dim=4, K=256 β 8/4 = 2 bits; dim=16, K=4096 β 0.77 bits. The promise: a codebook captures correlations between adjacent weights that scalar rounding throws away β so the same bit budget should buy more fidelity.
That promise is real, but only in a narrow band, and only if you measure it correctly.
Measuring it correctly
We evaluated three ways, in increasing order of honesty:
- Weight MSE β reconstruction error of the matrix itself. Cheap, and misleading: it has no idea which errors the forward pass amplifies.
- Single-tensor output fidelity β feed random inputs
xthroughWandW_recon, measure output cosine. Better in principle, but with Gaussian inputs it still lied to us (see below). - Stacked real-data perplexity β quantize every layer, run the whole model on real text, measure perplexity. This is the only metric that tracks what a user would actually feel.
The trap is error accumulation. Per-tensor metrics look at one layer in isolation. But a 30-layer model stacks 30 small distortions, each feeding the next; a per-layer cosine of 0.99 does not mean a stacked cosine of 0.99. The accumulation is the whole story, and only metric (3) sees it.
The sweep
Stacked real perplexity on a 2B base model (baseline ppl 1.89):
| method | bits/w | perplexity | ratio vs baseline |
|---|---|---|---|
| scalar 2-bit | 3.00 | 10.14 | 5.36Γ |
| scalar 3-bit | 4.00 | 1.97 | 1.04Γ |
| scalar 4-bit | 5.00 | 1.91 | 1.01Γ |
| VQ d4 K4096 | 3.02 | 2.10 | 1.11Γ |
| VQ d4 K256 | 2.00 | 22.80 | 12.04Γ |
| VQ d2 K256 | 4.00 | 2.03 | 1.07Γ |
(Scalar "N-bit" carries an extra ~1 b/w of block-scale overhead, hence the bits/w column.)
Read the table at iso-bitrate:
- At ~3 b/w, VQ (1.11Γ) crushes scalar (5.36Γ). Scalar has to spend a full extra bit to catch up. This is the win: VQ-3bit β scalar-4bit quality at 25% less size.
- At ~4 b/w they tie (scalar marginally better, and ~1000Γ faster to compute). Above 3 bits there is no reason to pay VQ's encoding cost.
- At 2 b/w naΓ―ve VQ is broken (12Γ) β exactly like naΓ―ve scalar at 2 bits is broken. This is the universal "RTN-at-2-bits destroys models, which is why GPTQ exists" result, and VQ does not escape it for free.
The mirage we chased and killed
Per-tensor output fidelity made dim=4, K=256 @ 2 bits look like a free lunch: output cosine 0.991, worst-case 0.98 β visually indistinguishable from the original at half the bits. Stacked, it is 12Γ worse perplexity. That single result is the most important thing in this writeup: we nearly shipped a 2-bit config on the strength of a per-tensor number, and it would have produced garbage.
We also tried to rescue sub-1-bit fidelity with a sparse high-precision residual on the worst sub-vectors (the natural "fix the outliers" move). Refuted. VQ error is diffuse β spread thinly across many sub-vectors, not concentrated in a few β so a sparse correction plus its locator bitmap wastes the budget: reaching 0.97 fidelity sub-1-bit costs more total bits than just using dim=4, K=256 @ 2 bits. The cheap-and-good permutation and the expensive-and-good one are never the same one. (The identical trap sinks "reorder weights to compress" schemes.)
Where the real 2-bit lives
Getting to a usable 2 bits is not impossible β it just isn't free. It requires calibration data, per-channel scales, and GPTQ-style error feedback that pushes each layer's quantization error into the next layer's weights. That is precisely the machinery of AQLM and QuIP#. The honest framing: the 3-bit VQ result stands on its own as a solid, shippable compression win; the 2-bit result is a research program, not a config flag.
Practical takeaways
- If you quantize, evaluate on stacked real perplexity. Per-tensor MSE and single-tensor cosine will happily approve a model that generates noise.
- 3-bit VQ is a genuine ~25% saving over 4-bit scalar at matched quality. If you are RAM-bound and already at Q4, VQ-3bit is the next real step down.
- Don't believe sub-2-bit claims that aren't backed by stacked perplexity and an error-feedback method. The bits/weight number is the easy part; keeping the model alive at that bitrate is the hard part.
Methodology and code are public. Reconstruction, packing (12-bit index bit-packing β true 3 b/w on disk) and the stacked-perplexity harness are in the repository. Model training details are out of scope here β this is a study of weight quantization on public base checkpoints.
Repository contents (code)
This repo is the method + code, reproducible on any public base checkpoint. No model weights are shipped here (the on-disk .vqz format is non-standard; a standard-format model card is a separate effort).
| file | what it does |
|---|---|
vq_sweep.py |
per-tensor VQ bitrate sweep + output fidelity (the metric that turned out to lie) |
vq_stacked_sweep.py |
the definitive result: loads the model once, restores+quantizes per config, measures real stacked perplexity |
vq_stacked_ppl.py |
single-config stacked perplexity harness |
vq_residual_lever.py |
sub-1-bit residual lever β refuted (VQ error is diffuse, sparse correction wastes the budget) |
vq_pack.py |
produces a real on-disk artifact (~3 b/w) via 12-bit index bit-packing; round-trip verified |
vq_load.py |
reconstructs a model from the packed artifact and generates text to verify coherence |
vq_compress.py |
core VQ encode/decode primitives |
Reproduce the headline table
pip install -r requirements.txt
python vq_stacked_sweep.py # stacked real perplexity, VQ vs scalar, per config
The 3-bit VQ win (β25% smaller than 4-bit scalar at matched perplexity) and the "per-tensor metrics lie" result both fall out of this single script.
Study of weight quantization on public base checkpoints. Model training details are out of scope.