GLM-5.2 DSpark speculator
Overview
A DSpark speculator model for the zai-org/GLM-5.2-FP8 base model, enabling faster
inference through speculative decoding. DSpark extends the DFlash parallel draft
backbone with two lightweight heads: a Markov logit-bias head (low-rank
intra-block token dependency) and a per-position confidence head (accept-rate
prediction). Trained with the speculators
library.
Model Specifications
- Base Model: zai-org/GLM-5.2-FP8
- Chat Template: GLM-5.2 (compatible with
/chat/completions) - Format: Safetensors
- License: MIT
- Draft: 5 layers,
block_size=8, full vocabulary (154,880), aux layers[8, 23, 39, 55, 70] - Validation Hardware: NVIDIA B300
Checkpoint series
This repo publishes per-epoch checkpoints of a single 3-epoch run. main tracks
the latest available epoch; each epoch is also a permanent revision.
| revision | epoch | status |
|---|---|---|
epoch-1 |
1 / 3 | ✅ available |
epoch-2 |
2 / 3 | ✅ this checkpoint (= main) |
epoch-3 |
3 / 3 | training |
from transformers import AutoModel
model = AutoModel.from_pretrained(
"mgoin/GLM-5.2-speculator.dspark", revision="epoch-2", trust_remote_code=True
)
Training Details
The model was trained using the Speculators library on prompts from
Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and HuggingFaceH4/ultrachat_200k,
with responses regenerated by GLM-5.2-FP8 itself (published as
mgoin/GLM-5.2-FP8-magpie-ultrachat).
Training is online: the draft consumes hidden states streamed on-the-fly from a live GLM-5.2-FP8 vLLM server, with the trainer running FSDP data-parallel on separate GPUs. The three commands below (data prep → server → trainer) reproduce the run. Install speculators and vLLM from main. GPU indices/parallelism are examples — adjust to your hardware.
Data Preparation
python scripts/prepare_data.py \
--model zai-org/GLM-5.2-FP8 \
--trust-remote-code \
--data ./regenerated_data.jsonl \
--output ./output \
--seq-length 8192 \
--assistant-pattern '<\|assistant\|>((?:(?!<\|user\|>|<\|assistant\|>).)*)'
--assistant-patternis currently needed for GLM-5.2's inline-reasoning chat format (the<think>...</think>trace is kept inside the assistant turn); it may be auto-detected by future speculators versions.
vLLM Server Launch (hidden-states server)
CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/launch_vllm.py \
zai-org/GLM-5.2-FP8 \
--target-layer-ids 8 23 39 55 70 \
-- --port 8000 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--trust-remote-code
Training Command
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \
--standalone \
--nproc_per_node 4 \
scripts/train.py \
--verifier-name-or-path zai-org/GLM-5.2-FP8 \
--speculator-type dspark \
--num-layers 5 \
--block-size 8 \
--data-path ./output \
--vllm-endpoint http://localhost:8000/v1 \
--save-path ./output/checkpoints \
--epochs 3 \
--lr 0.0006 \
--scheduler-type cosine \
--total-seq-len 4096 \
--draft-arch qwen3 \
--draft-hidden-act silu \
--target-layer-ids 8 23 39 55 70 \
--max-anchors 1024 \
--markov-rank 256 \
--enable-confidence-head \
--confidence-head-with-markov \
--loss-fn '{"ce": 0.1, "tv": 0.9}' \
--confidence-head-alpha 1.0 \
--checkpoint-freq 0.2 \
--on-missing generate \
--on-generate delete \
--seed 42 \
--log-freq 100 \
--prefetch-factor 2 \
--num-workers 8 \
--trust-remote-code
Notes:
- Omitting
--draft-vocab-sizetrains on the full vocabulary; pass--draft-vocab-size 32000for a reduced draft vocab. - DSpark-specific flags:
--markov-rank,--enable-confidence-head,--confidence-head-with-markov,--confidence-head-alpha. Dropping them (and using--speculator-type dflash) recovers a plain DFlash draft. --checkpoint-freq 0.2writes sub-epoch checkpoints (~5 per epoch) for resumability.
Deployment
DSpark inference support is available on the vLLM nightly
(uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly):
vllm serve zai-org/GLM-5.2-FP8 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--trust-remote-code \
--speculative-config '{
"model": "RedHatAI/GLM-5.2-speculator.dspark",
"num_speculative_tokens": 7,
"method": "dspark",
"draft_sample_method": "probabilistic"
}'
Evaluation Results
Train-set metrics at the end of epoch 2 (per-epoch validation passes did not complete due to server restarts; per-dataset acceptance evaluation will accompany the final checkpoint):
| metric | epoch 1 | epoch 2 |
|---|---|---|
| mean accepted length | 3.376 | 3.819 |
| full accuracy | 0.532 | 0.587 |
| mean acceptance rate | 0.497 | 0.557 |
Per-position accuracy (positions 1-7), epoch 2:
0.809 / 0.693 / 0.619 / 0.549 / 0.515 / 0.478 / 0.445
Still improving epoch-over-epoch (epoch 3 in training).
Acceptance length in vLLM (revision epoch-2)
Measured end-to-end in vLLM speculative decoding (nightly
0.23.1rc1.dev709+g2b753ad20), serving zai-org/GLM-5.2-FP8 on 4xB300 with
num_speculative_tokens=7, draft_sample_method="probabilistic", greedy
sampling, batch size 1, 64 single-turn chat prompts (32 HumanEval + 32
math_reasoning from
RedHatAI/speculator_benchmarks),
1024 output tokens each. Acceptance is computed from vLLM's
spec_decode_num_accepted_tokens_per_pos / num_drafts counter deltas.
The earlier
GLM-5.2-speculator.dspark-preview
is included for reference (measured on a 16-prompt subset of the same set).
This table is updated as later epochs are published;
maincurrently points toepoch-2.
| Checkpoint | Pos 1 | Pos 2 | Pos 3 | Pos 4 | Pos 5 | Pos 6 | Pos 7 | Accept Len | Decode tok/s |
|---|---|---|---|---|---|---|---|---|---|
epoch-2 (= main) |
74.7% | 56.0% | 41.5% | 31.1% | 23.7% | 18.0% | 13.2% | 3.58 | 225 |
epoch-1 |
74.7% | 55.4% | 40.2% | 29.5% | 21.5% | 15.4% | 10.9% | 3.48 | 219 |
dspark-preview |
57.5% | 31.7% | 15.4% | 7.4% | 3.3% | 1.5% | 0.7% | 2.18 | 139 |
The epoch-2 gain comes from deeper draft positions (position-1 acceptance is
unchanged), consistent with continued training. For reference, the same
server without speculative decoding decodes at 102 tok/s (2.2x speedup for
epoch-2).
References
- DFlash: Block Diffusion for Flash Speculative Decoding (arXiv:2602.06036) — the parallel draft backbone DSpark builds on.
- DSpark (DeepSeek) — the Markov + confidence-head additions replicated here.
- speculators — training library.
AI assistance was used to build the training pipeline and run these experiments.
- Downloads last month
- -
Model tree for RedHatAI/GLM-5.2-speculator.dspark
Base model
zai-org/GLM-5.2-FP8