Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8
GLM-5.2 — mixed-bit VQ (AQLM) ~1.86-bit
A ~180 GiB quantization of GLM-5.2 (744B Chinese-native reasoning MoE, MIT) that keeps Japanese/English/Chinese thinking-mode quality at ~1.86 bit/weight, using vector quantization with GPTQ error compensation (AQLM-style) instead of scalar rounding.
Runs on 2× RTX PRO 6000 (sm_120 Blackwell, ~95 GiB each) via vLLM.
Why VQ
At the same size, scalar mixed-bit rounding loses too much at 1–2 bit. Replacing the scalar codes with a shared vector codebook + per-row error compensation recovers most of it (the two together are super-additive — neither alone is enough):
| Metric | Scalar mixed-bit (same ~180 GiB) | This (VQ + compensation) |
|---|---|---|
| Calibration KL (fake-quant, iso-size) | baseline | −47 % |
| Greedy arithmetic eval (JA/EN/ZH, terminate + correct) | 21/22 | 22/22 |
| 1-bit experts | collapse (KL ≈ 13) | survive (KL ≈ 0.42) |
The −47 % KL is a fake-quant, iso-size comparison; the deployable end-to-end signal is the greedy eval (multi-digit multiplication and word problems in all three languages, including held-out items not in calibration).
Serving
This is not a plug-and-play GGUF — it needs a matching sm_120 stack:
- vLLM with
GlmMoeDsaForCausalLM+ sm_120 kernels (reference:jasl/vllmPR-41834 sm12x preview). - transformers 5.12.
- The VQ serving plugin from mmzz164/OneCompression @
glm-serving-v1— seeexample/glm-5.2/for the launcher and full instructions. - 2× ~95 GiB sm_120 GPUs, EP=1 / TP=2 (VQ codes can't be tensor-parallel-sharded).
GLM_CKPT=/path/to/this/model bash start_glm_api_vq.sh # OpenAI API :8001, served as "glm-5.2"
MixedVQMoEMethod is auto-selected from the format:"vq" markers in quantization_config.
Performance
- ~16 tok/s steady-state decode (single stream), 38× over the eager dequant baseline (grouped Triton VQ-GEMM + CUDA graphs; the key win was fixing a shared-memory bank conflict in the codebook gather).
- Context ≤ 4096 (dense MLA — sm_120 has no sparse-DSA forward kernel; dense is exact at ctx ≤ 2048 and validated functional, incl. >2048 needle retrieval, up to 4096).
Allocation
Mixed 1/2/3-bit per expert (≈ 15.8k @1-bit / 34.0k @2-bit / 7.8k @3-bit projections), allocated by activation-frequency-aware AutoBit (arithmetic-routing tuned), then re-encoded to VQ codes + a shared codebook per bit-width. Non-expert "spine" stays scalar 4-bit.
Limitations
- sm_120-specific serving stack (not portable to plain wheels).
- Context capped at 4096 on this hardware (sm_120 sparse-attention kernel gap + VRAM); longer context needs smaller weights.
- Default serving template is thinking-on at max reasoning effort — thorough but verbose
for casual chat (pass
chat_template_kwargs={"enable_thinking": false}for direct answers).
License & attribution
- This quantized model: MIT.
- Base GLM-5.2: MIT, © Zhipu AI — this is a derivative; all rights/attribution to upstream.
- Quantization/serving built on OneCompression (MIT, © Fujitsu Ltd.) and vLLM / transformers (Apache-2.0).
- Downloads last month
- 179
Model tree for aquaman164/GLM-5.2-VQ-Arith
Base model
zai-org/GLM-5.2