GLM-5.2 DSpark speculator

Overview

A DSpark speculator model for the zai-org/GLM-5.2-FP8 base model, enabling faster inference through speculative decoding. DSpark extends the DFlash parallel draft backbone with two lightweight heads: a Markov logit-bias head (low-rank intra-block token dependency) and a per-position confidence head (accept-rate prediction). Trained with the speculators library.

Model Specifications

Base Model: zai-org/GLM-5.2-FP8
Chat Template: GLM-5.2 (compatible with /chat/completions)
Format: Safetensors
License: MIT
Draft: 5 layers, block_size=8, full vocabulary (154,880), aux layers [8, 23, 39, 55, 70]
Validation Hardware: NVIDIA B300

Checkpoint series

This repo publishes per-epoch checkpoints of a single 3-epoch run. main tracks the latest available epoch; each epoch is also a permanent revision.

revision	epoch	status
`epoch-1`	1 / 3	✅ available
`epoch-2`	2 / 3	✅ this checkpoint (= `main`)
`epoch-3`	3 / 3	training

from transformers import AutoModel
model = AutoModel.from_pretrained(
    "mgoin/GLM-5.2-speculator.dspark", revision="epoch-2", trust_remote_code=True
)

Training Details

The model was trained using the Speculators library on prompts from Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and HuggingFaceH4/ultrachat_200k, with responses regenerated by GLM-5.2-FP8 itself (published as mgoin/GLM-5.2-FP8-magpie-ultrachat).

Training is online: the draft consumes hidden states streamed on-the-fly from a live GLM-5.2-FP8 vLLM server, with the trainer running FSDP data-parallel on separate GPUs. The three commands below (data prep → server → trainer) reproduce the run. Install speculators and vLLM from main. GPU indices/parallelism are examples — adjust to your hardware.

Data Preparation

python scripts/prepare_data.py \
  --model zai-org/GLM-5.2-FP8 \
  --trust-remote-code \
  --data ./regenerated_data.jsonl \
  --output ./output \
  --seq-length 8192 \
  --assistant-pattern '<\|assistant\|>((?:(?!<\|user\|>|<\|assistant\|>).)*)'

--assistant-pattern is currently needed for GLM-5.2's inline-reasoning chat format (the <think>...</think> trace is kept inside the assistant turn); it may be auto-detected by future speculators versions.

vLLM Server Launch (hidden-states server)

CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/launch_vllm.py \
  zai-org/GLM-5.2-FP8 \
  --target-layer-ids 8 23 39 55 70 \
  -- --port 8000 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192 \
  --trust-remote-code

Training Command

CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun \
  --standalone \
  --nproc_per_node 4 \
  scripts/train.py \
  --verifier-name-or-path zai-org/GLM-5.2-FP8 \
  --speculator-type dspark \
  --num-layers 5 \
  --block-size 8 \
  --data-path ./output \
  --vllm-endpoint http://localhost:8000/v1 \
  --save-path ./output/checkpoints \
  --epochs 3 \
  --lr 0.0006 \
  --scheduler-type cosine \
  --total-seq-len 4096 \
  --draft-arch qwen3 \
  --draft-hidden-act silu \
  --target-layer-ids 8 23 39 55 70 \
  --max-anchors 1024 \
  --markov-rank 256 \
  --enable-confidence-head \
  --confidence-head-with-markov \
  --loss-fn '{"ce": 0.1, "tv": 0.9}' \
  --confidence-head-alpha 1.0 \
  --checkpoint-freq 0.2 \
  --on-missing generate \
  --on-generate delete \
  --seed 42 \
  --log-freq 100 \
  --prefetch-factor 2 \
  --num-workers 8 \
  --trust-remote-code

Notes:

Omitting --draft-vocab-size trains on the full vocabulary; pass --draft-vocab-size 32000 for a reduced draft vocab.
DSpark-specific flags: --markov-rank, --enable-confidence-head, --confidence-head-with-markov, --confidence-head-alpha. Dropping them (and using --speculator-type dflash) recovers a plain DFlash draft.
--checkpoint-freq 0.2 writes sub-epoch checkpoints (~5 per epoch) for resumability.

Deployment

DSpark inference support is available on the vLLM nightly (uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly):

vllm serve zai-org/GLM-5.2-FP8 \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --trust-remote-code \
    --speculative-config '{
        "model": "RedHatAI/GLM-5.2-speculator.dspark",
        "num_speculative_tokens": 7,
        "method": "dspark",
        "draft_sample_method": "probabilistic"
    }'

Evaluation Results

Train-set metrics at the end of epoch 2 (per-epoch validation passes did not complete due to server restarts; per-dataset acceptance evaluation will accompany the final checkpoint):

metric	epoch 1	epoch 2
mean accepted length	3.376	3.819
full accuracy	0.532	0.587
mean acceptance rate	0.497	0.557

Per-position accuracy (positions 1-7), epoch 2: 0.809 / 0.693 / 0.619 / 0.549 / 0.515 / 0.478 / 0.445

Still improving epoch-over-epoch (epoch 3 in training).

Acceptance length in vLLM (revision `epoch-2`)

Measured end-to-end in vLLM speculative decoding (nightly 0.23.1rc1.dev709+g2b753ad20), serving zai-org/GLM-5.2-FP8 on 4xB300 with num_speculative_tokens=7, draft_sample_method="probabilistic", greedy sampling, batch size 1, 64 single-turn chat prompts (32 HumanEval + 32 math_reasoning from RedHatAI/speculator_benchmarks), 1024 output tokens each. Acceptance is computed from vLLM's spec_decode_num_accepted_tokens_per_pos / num_drafts counter deltas. The earlier GLM-5.2-speculator.dspark-preview is included for reference (measured on a 16-prompt subset of the same set).

This table is updated as later epochs are published; main currently points to epoch-2.

Checkpoint	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Accept Len	Decode tok/s
`epoch-2` (= `main`)	74.7%	56.0%	41.5%	31.1%	23.7%	18.0%	13.2%	3.58	225
`epoch-1`	74.7%	55.4%	40.2%	29.5%	21.5%	15.4%	10.9%	3.48	219
`dspark-preview`	57.5%	31.7%	15.4%	7.4%	3.3%	1.5%	0.7%	2.18	139

The epoch-2 gain comes from deeper draft positions (position-1 acceptance is unchanged), consistent with continued training. For reference, the same server without speculative decoding decodes at 102 tok/s (2.2x speedup for epoch-2).

References

DFlash: Block Diffusion for Flash Speculative Decoding (arXiv:2602.06036) — the parallel draft backbone DSpark builds on.
DSpark (DeepSeek) — the Markov + confidence-head additions replicated here.
speculators — training library.

AI assistance was used to build the training pipeline and run these experiments.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for RedHatAI/GLM-5.2-speculator.dspark

Base model

zai-org/GLM-5.2-FP8

Finetuned

(3)

this model

Dataset used to train RedHatAI/GLM-5.2-speculator.dspark

Paper for RedHatAI/GLM-5.2-speculator.dspark

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 89