COMPLEXITY-DEEP Token-Routed MoE (383.5M) โ€” Training Checkpoint (Step 15,259)

Resumable training checkpoint with full optimizer state at the end of 8B tokens training.

Note: This model was trained with a Chinchilla-like token budget (8B tokens for 383.5M parameters, ~21 tokens/param). The model may benefit from continued training beyond this point.

Contents

  • checkpoint.pt - Model weights + training state
  • model.safetensors - Model weights (safetensors format)
  • optimizer_rank0.pt - AdamW optimizer state (GPU 0)
  • optimizer_rank1.pt - AdamW optimizer state (GPU 1)
  • training_state.json - Step counter, LR, etc.

Model Config

  • Parameters: 383.5M total, ~105M active per token
  • Hidden: 1024, Layers: 20, Heads: 16, KV Heads: 4
  • Experts: 4, Intermediate: 3200 (800/expert), Shared: 800
  • Training: 8B tokens (15,259 steps), AdamW lr=2.1e-4, cosine 5% warmup

Resume Training

import torch

checkpoint = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])

# Load optimizer for your GPU rank (0 or 1)
rank = torch.distributed.get_rank()
optimizer_state = torch.load(f"optimizer_rank{rank}.pt", map_location="cpu")
optimizer.load_state_dict(optimizer_state)

# Resume from step 15,259

Pretrained Weights (inference)

For inference use the safetensors checkpoint in ../final/ instead.

License

CC-BY-NC-4.0

Complexity-ML -- 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.4B params
Tensor type
I64
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support