Kimi-K2.7-Code Eagle3-MLA draft (32K-truncated vocab)

Eagle3-MLA speculative-decoding draft model for Kimi-K2.7-Code, with the output lm_head truncated from the full 163,840 vocabulary to the top-32,000 highest-frequency tokens (by real K2.7-Code serving-traffic token distribution), plus the 256 special/template tokens force-included.

What changed vs the full-vocab draft

lm_head.weight: [163840, 7168] -> [32000, 7168]
added d2t (draft-local id -> target global id, delta-encoded) so vLLM scatters the 32K draft logits back into the 163,840 target space
embed_tokens kept full ([163840, 7168]) — draft input lookups are unaffected
config.json: draft_vocab_size: 32000 (was 163840)
token coverage of the real draft-token distribution: 0.9927

Architecture

Single-layer Eagle3 decoder on the DeepSeek-V2/V3 MLA attention (Eagle3DeepseekV2ForCausalLM), hidden_size=7168. Loads in vLLM via --speculative-config '{"method":"eagle3","model":"<this repo>","num_speculative_tokens":3}'.

Notes

The 32K truncation reduces the lm_head GEMM ~4x in isolation and is a clear win at batch=1 / on-device decoding (memory-bandwidth-bound). At high-concurrency EP+DP serving (e.g. c=128) the end-to-end gain is small, because the lm_head is not the bottleneck there. Output correctness is unaffected — the target model verifies every speculated token.

Downloads last month: 592

Safetensors

Model size

2B params

Tensor type

I64

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for k-l-lambda/kimi-k2.7-code-eagle3-mla

Base model

moonshotai/Kimi-K2.7-Code

Finetuned

(4)

this model