Qwen3-0.6B Deep MoE (FineWeb recovery, step 1500)
Sparse-upcycled Qwen3-0.6B with middle layer stacking (28 → 58 layers) and linear-ramp MoE (peak 25 experts, top-2 routing).
Training
- Stage 1: Reasoning/chat mix on MoE-only weights (10k steps)
- Stage 2: FineWeb-Edu continued pretrain, full finetune (this checkpoint at step 1500/8000)
- Optimizer: Muon + AdamW
- Seq len: 512
Loading
Requires the custom upcycle loader in llm_from_scratch or load via load_upcycled_qwen3() with moe_upcycle.json included in this repo.
from llm.model.qwen3_upcycle import load_upcycled_qwen3
model, meta = load_upcycled_qwen3("avewright/qwen3-0.6b-deep-moe-fineweb")
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support