Qwen3-0.6B Deep MoE (FineWeb recovery, step 1500)

Sparse-upcycled Qwen3-0.6B with middle layer stacking (28 → 58 layers) and linear-ramp MoE (peak 25 experts, top-2 routing).

Training

Stage 1: Reasoning/chat mix on MoE-only weights (10k steps)
Stage 2: FineWeb-Edu continued pretrain, full finetune (this checkpoint at step 1500/8000)
Optimizer: Muon + AdamW
Seq len: 512

Requires the custom upcycle loader in llm_from_scratch or load via load_upcycled_qwen3() with moe_upcycle.json included in this repo.

from llm.model.qwen3_upcycle import load_upcycled_qwen3
model, meta = load_upcycled_qwen3("avewright/qwen3-0.6b-deep-moe-fineweb")

Safetensors

Model size

7B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Finetuned

Finetuned

this model