Configuration Parsing Warning:In config.json: "quantization_config.bits" must be less than or equal to 8

GLM-5.2 — mixed-bit VQ (AQLM) ~1.86-bit

A ~180 GiB quantization of GLM-5.2 (744B Chinese-native reasoning MoE, MIT) that keeps Japanese/English/Chinese thinking-mode quality at ~1.86 bit/weight, using vector quantization with GPTQ error compensation (AQLM-style) instead of scalar rounding.

Runs on 2× RTX PRO 6000 (sm_120 Blackwell, ~95 GiB each) via vLLM.

Why VQ

At the same size, scalar mixed-bit rounding loses too much at 1–2 bit. Replacing the scalar codes with a shared vector codebook + per-row error compensation recovers most of it (the two together are super-additive — neither alone is enough):

Metric Scalar mixed-bit (same ~180 GiB) This (VQ + compensation)
Calibration KL (fake-quant, iso-size) baseline −47 %
Greedy arithmetic eval (JA/EN/ZH, terminate + correct) 21/22 22/22
1-bit experts collapse (KL ≈ 13) survive (KL ≈ 0.42)

The −47 % KL is a fake-quant, iso-size comparison; the deployable end-to-end signal is the greedy eval (multi-digit multiplication and word problems in all three languages, including held-out items not in calibration).

Serving

This is not a plug-and-play GGUF — it needs a matching sm_120 stack:

  • vLLM with GlmMoeDsaForCausalLM + sm_120 kernels (reference: jasl/vllm PR-41834 sm12x preview).
  • transformers 5.12.
  • The VQ serving plugin from mmzz164/OneCompression @ glm-serving-v1 — see example/glm-5.2/ for the launcher and full instructions.
  • 2× ~95 GiB sm_120 GPUs, EP=1 / TP=2 (VQ codes can't be tensor-parallel-sharded).
GLM_CKPT=/path/to/this/model bash start_glm_api_vq.sh   # OpenAI API :8001, served as "glm-5.2"

MixedVQMoEMethod is auto-selected from the format:"vq" markers in quantization_config.

Performance

  • ~16 tok/s steady-state decode (single stream), 38× over the eager dequant baseline (grouped Triton VQ-GEMM + CUDA graphs; the key win was fixing a shared-memory bank conflict in the codebook gather).
  • Context ≤ 4096 (dense MLA — sm_120 has no sparse-DSA forward kernel; dense is exact at ctx ≤ 2048 and validated functional, incl. >2048 needle retrieval, up to 4096).

Allocation

Mixed 1/2/3-bit per expert (≈ 15.8k @1-bit / 34.0k @2-bit / 7.8k @3-bit projections), allocated by activation-frequency-aware AutoBit (arithmetic-routing tuned), then re-encoded to VQ codes + a shared codebook per bit-width. Non-expert "spine" stays scalar 4-bit.

Limitations

  • sm_120-specific serving stack (not portable to plain wheels).
  • Context capped at 4096 on this hardware (sm_120 sparse-attention kernel gap + VRAM); longer context needs smaller weights.
  • Default serving template is thinking-on at max reasoning effort — thorough but verbose for casual chat (pass chat_template_kwargs={"enable_thinking": false} for direct answers).

License & attribution

  • This quantized model: MIT.
  • Base GLM-5.2: MIT, © Zhipu AI — this is a derivative; all rights/attribution to upstream.
  • Quantization/serving built on OneCompression (MIT, © Fujitsu Ltd.) and vLLM / transformers (Apache-2.0).
Downloads last month
179
Safetensors
Model size
52B params
Tensor type
I32
·
F16
·
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aquaman164/GLM-5.2-VQ-Arith

Base model

zai-org/GLM-5.2
Quantized
(74)
this model