YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3-4B DFlash β€” Training Journey: M1 β†’ M1.5 β†’ M1.6 β†’ M1.9

What we trained, on what data, with which recipe, and how the acceptance score moved at each step.

All scores are pooled acceptance Ο„ = E[n]+1 (pooled over all verify cycles), measured on the same harness (scripts/eval_draft_8gpu_qwen3.sh: 8 SGLang servers, triton backend, target Qwen/Qwen3-4B, thinking-OFF, T=0, FULL prompts, B16). Higher = more tokens accepted per verify cycle (ceiling = B+1 = 17).

The reference bar is z-lab M0 = the published z-lab/Qwen3-4B-DFlash-b16 draft measured on our harness (so it is apples-to-apples; the only thing that differs from our runs is the draft weights).


TL;DR β€” score progression (pooled Ο„)

run gsm8k math500 humaneval mbpp mt-bench avg Ξ” avg
M1 (baseline) 5.86 6.60 5.02 4.69 2.76 4.99 β€”
M1.5 6.10 6.70 5.28 4.90 2.81 5.16 +0.17
M1.6 ep2 (best) 6.19 6.90 5.87 5.20 3.00 5.43 +0.27
M1.9 ep3 (best) 6.27 7.18 5.65 5.31 2.83 5.45 +0.02
z-lab M0 (target bar) 6.27 8.24 6.72 5.61 3.43 6.05 +0.60 ahead

Net: +0.46 avg from M1 β†’ M1.9 (4.99 β†’ 5.45), closing ~43% of the original 1.07-point gap to z-lab. gsm8k is now fully closed (6.27 = 6.27). The remaining gap is concentrated on math500 (βˆ’1.06) and humaneval (βˆ’1.07).


Stage detail

M1 β€” baseline (fresh, paper-faithful)

field value
Init from scratch (fresh draft)
Data 888,784 samples β€” the v10-balanced Gemma prompt set (891,503 prompts), a broad mix (NuminaMath, OpenMathReasoning, OpenMathInstruct-2, Nemotron-v2 math/code, open_code_instruct, evol_codealpaca, chat)
Labels regenerated with Qwen/Qwen3-4B, greedy T=0, thinking-OFF, seqlen cap 3072
Loss soft-label KD (pure forward-KL over full target vocab, Ξ±=0), decay-Ξ³=7
Schedule 6 epochs, AdamW, lr 6e-4, cosine, warmup 0.04, grad-clip 1.0
Batch GBS 64 (8/device Γ— 8 GPU), 512 blocks/seq, ~83K steps total
Eval avg 4.99

This already matched z-lab's scale (β‰ˆ800K) and epoch count (6) and most hyperparameters β€” yet landed 1.07 below z-lab. The gap shape (small on gsm8k, large on code/math) points at data composition, not scale: the v10 mix is a Gemma-era curation, not z-lab's clean Nemotron-PTD-v2 + CodeAlpaca sources. Every later run warm-starts off M1, so all inherit this base composition.

M1.5 β€” first continuation (unseen leftover, switch to EAL)

field value
Init warm-start from M1 (qwen3-4b-m1-kd-b16-g7-rope1e6/final_checkpoint.pt)
Data 374,110 samples β€” math 149,672 / code 149,803 / chat 74,617
Sources prompts M1 never saw (hash-subtracted, eval-decontam'd): math = OpenMathInstruct-2 + Nemotron-v2 math + OpenMathReasoning; code = open_code_instruct + evol_codealpaca + ~30K function-completion (MBPP-train 374 + CodeSearchNet, HumanEval-shaped); chat = nem_ifc_chat (mt-bench anchor)
Loss EAL (Expected-Acceptance-Length; uniform KL position weight)
Schedule 3 epochs (ckpts step5846 / 11692 / final)
Eval avg 5.16 (+0.17 vs M1)

First switch from KD β†’ EAL loss, plus fresh unseen data. Lifted every benchmark a little; humaneval/mbpp still lagged β†’ motivated M1.6's code focus.

M1.6 β€” code-completion + survival-EAL (biggest single jump)

field value
Init warm-start from M1.5
Data 224,024 samples β€” math 99,254 / code 99,921 / chat 24,840
Sources math = NuminaMath-CoT HARD 100K (olympiads/AMC/AoPS + the math Hendrycks slice; decontam'd vs math500); code = function-completion 120K→~100K (jinaai/code_exercises + bigcode/self-oss-instruct, exact HumanEval/MBPP shape); chat retention
Loss survival-EAL (NEW: KL position weights = survival-based, so deeper block positions are weighted by their probability of still being "alive")
Schedule 3 epochs (ckpts step3501 / 7002 / final). Best = ep2 (ep3 over-trains, esp. humaneval)
Eval avg 5.43 (+0.27 vs M1.5) β€” largest step

Two levers together: (1) function-completion code (matching HumanEval/MBPP format) β†’ humaneval 5.28β†’5.87, mbpp 4.90β†’5.20; (2) survival-EAL β†’ uniform lift everywhere. qwen3_m16_ep2 is the promoted checkpoint and the warm-start base for M1.7/M1.8/M1.9.

M1.9 β€” math-spine (unique hard-math volume)

field value
Init warm-start from M1.6 ep2 (checkpoint_step7002.pt)
Data 223,846 samples β€” math 163,846 / code 30,000 / chat 30,000
Sources math = the 143K unused NuminaMath-CoT HARD (never trained in M1/M1.5/M1.6) + 22K synthetic_math; code/chat = anchors reused from M1.8 (instruction-style)
Labels Qwen3-4B greedy thinking-OFF, max_new_tokens 3072 (~22% length-capped)
Loss survival-EAL (same as M1.6)
Schedule 3 epochs (ckpts step3498 / 6996 / final). Best math500 = ep3
Eval ep3 5.45 avg; math500 7.18 (new best); gsm8k 6.27 (= z-lab)

The diagnosis driving M1.9: a label audit confirmed our math labels already match z-lab's verbose long-LaTeX style (mean ~1,716 tok, ~85 LaTeX markers, 78.8% \boxed), so the math500 gap is unique-hard-math coverage, not label quality. Result: math500 monotonic 6.90 β†’ 7.18, gsm8k closed β€” but two small regressions from thinning the anchors: mt-bench 3.00β†’2.83 (chat anchor only 30K) and humaneval 5.87β†’5.65 (instruction-style code instead of M1.6's function-completion).

Side experiments off M1.6 ep2 (not on the main line): M1.7 (frequency-capped prompt repetition) and M1.8 (raw nem_v2 diversity expansion) each moved math500 only ~+0.14 (to ~7.04) and did not beat M1.6's avg. The diminishing returns across repetition β†’ diversity β†’ volume (each ~+0.14) are the signature of corpus saturation on this NuminaMath-based math source.


How the loss function evolved

stage loss idea
M1 soft-label KD match the target's full per-position distribution (forward-KL)
M1.5 EAL directly maximize Expected Acceptance Length (reward-shaped); uniform KL weighting across block positions
M1.6 β†’ M1.9 survival-EAL EAL + KL position weights scaled by each position's survival probability (chance the block is still accepted at that depth) β€” focuses capacity on positions that actually get verified

EAL is implemented as a negative, reward-maximizing objective (a sign flip on the acceptance reward); survival weighting was the M1.6 refinement that produced the largest single avg jump.


Where the remaining gap is (vs z-lab M0 = 6.05 avg)

benchmark best of ours z-lab M0 gap
gsm8k 6.27 6.27 0.00 βœ…
math500 7.18 8.24 βˆ’1.06
humaneval 5.87 6.72 βˆ’0.85
mbpp 5.35 5.61 βˆ’0.26
mt-bench 3.00 3.43 βˆ’0.43

Key conclusion: scale and epochs were already matched at M1 (888K / 6 epochs), so the residual gap is not a volume problem. It traces to data composition β€” M1 (and thus all warm-started descendants) was built on the v10-balanced mix, not z-lab's exact Nemotron-PTD-v2 + evol-codealpaca sources. The cleanest untested experiment is a fresh full-scale run on z-lab's exact source composition, not more fine-tuning on the inherited v10 base.

Last updated: 2026-06-14.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support