spawn-router
A compact, fast, local-first multi-head classifier for coding-agent task routing. Given a task prompt at kickoff, it predicts stable task properties; a downstream policy/config then maps those properties to a model, provider, and execution behavior. The classifier predicts the ontology; your config owns the orchestration.
This is the model component of spawn β see
Afterbuild/spawn-router for the
training code and spawn-gateway for the local
gateway that wraps Claude Code / Codex and routes with these weights.
- Backbone:
microsoft/deberta-v3-small(multi-head) - Checkpoint: v6 (final text-only training run)
- Inference: torch-free ONNX path, ~7 ms/prompt CPU, ~140 MB deps
- Input: text only (
current_text), 256-token max
What it predicts
complexity: easy | medium | hard
+ sub-dims (0..1 regression): reasoning_depth, scope_breadth,
domain_knowledge, spec_completeness (inverted: low spec = harder)
task_type: bugfix | feature | refactor | test | design | docs | migration | exploration
risk: low | medium | high
+ sub-dims (0..1 regression): security_surface, data_sensitivity,
production_exposure, reversal_cost
+ per-head confidences (post-hoc temperature-scaled) and overall_confidence
complexityβ capability tier (small / mid / large model)task_typeβ model specialty (e.g. design β Claude, systems β GPT, docs β small)riskβ tier bumper (easy + high-risk still routes capable) and confirmation gate
The ONNX graphs emit the three classification logits plus all eight regression sub-dimension scores. Output names are:
complexity_logits,task_type_logits,risk_logitscomplexity_sub_reasoning_depth,complexity_sub_scope_breadth,complexity_sub_domain_knowledge,complexity_sub_spec_completenessrisk_sub_security_surface,risk_sub_data_sensitivity,risk_sub_production_exposure,risk_sub_reversal_cost
Routing is kickoff-only: classify once at task start and lock the model for the whole task cycle (no per-turn re-routing β no context thrash).
Files
| File | What |
|---|---|
spawn_router.int8.onnx |
int8-quantized graph β recommended for serving (~164 MB) |
spawn_router.onnx |
fp32 graph (~540 MB) |
model.pt |
PyTorch state dict β for fine-tuning / sub-dim outputs (~565 MB) |
spm.model + *tokenizer*.json |
SentencePiece (DeBERTa-v2/spm) tokenizer |
model_config.json |
architecture + label maps |
temperature_scaling.json |
per-head calibration temperatures |
*_metrics.json, battery_results.json |
evaluation results |
Usage (ONNX, torch-free)
Needs only onnxruntime, numpy, and sentencepiece β no torch, no transformers.
import numpy as np
import onnxruntime as ort
import sentencepiece as spm
MODEL_DIR = "." # dir containing spawn_router.int8.onnx, spm.model, *.json
MAX_LEN = 256
LABELS = {
"complexity_logits": ["easy", "medium", "hard"],
"task_type_logits": ["bugfix", "feature", "refactor", "test",
"design", "docs", "migration", "exploration"],
"risk_logits": ["low", "medium", "high"],
}
TEMPS = { # from temperature_scaling.json; output name -> head temperature
"complexity_logits": 0.891251,
"task_type_logits": 0.707946,
"risk_logits": 1.059254,
}
sp = spm.SentencePieceProcessor(model_file=f"{MODEL_DIR}/spm.model")
sess = ort.InferenceSession(f"{MODEL_DIR}/spawn_router.int8.onnx",
providers=["CPUExecutionProvider"])
def classify(text: str) -> dict:
# DeBERTa-v3 spm tokenizer: [CLS]=1 + pieces (truncated) + [SEP]=2
pieces = sp.encode(f"Current: {text}", out_type=int)[: MAX_LEN - 2]
ids = [1, *pieces, 2]
feeds = { # structured inputs are text-only sentinels (no interaction context)
"input_ids": np.array([ids], dtype=np.int64),
"attention_mask": np.ones((1, len(ids)), dtype=np.int64),
"previous_action_id": np.array([0], dtype=np.int64), # "none"
"previous_outcome_id": np.array([4], dtype=np.int64), # "unknown"
"log_recency_seconds": np.array([0.0], dtype=np.float32),
"has_interaction": np.array([0], dtype=np.int64),
"has_recency": np.array([0], dtype=np.int64),
}
out = {o.name: v for o, v in zip(sess.get_outputs(), sess.run(None, feeds))}
result = {}
for name, labels in LABELS.items():
logits = out[name][0] / TEMPS[name]
p = np.exp(logits - logits.max()); p /= p.sum()
i = int(p.argmax())
result[name.replace("_logits", "")] = {
"label": labels[i], "confidence": round(float(p[i]), 4),
}
result["complexity_sub"] = {
name.replace("complexity_sub_", ""): round(float(out[name][0]), 4)
for name in out if name.startswith("complexity_sub_")
}
result["risk_sub"] = {
name.replace("risk_sub_", ""): round(float(out[name][0]), 4)
for name in out if name.startswith("risk_sub_")
}
return result
print(classify("refactor JWT key rotation in prod"))
# {'complexity': {'label': ...}, 'task_type': {'label': ...}, 'risk': {'label': ...},
# 'complexity_sub': {'reasoning_depth': ...}, 'risk_sub': {'security_surface': ...}}
Evaluation
Two complementary measures (eval scripts in
Afterbuild/spawn-router:
scripts/eval_battery.py, eval.py):
Locked kickoff battery (83 hand-labeled probes, never in training β the canonical cross-version benchmark):
| Metric | v6 |
|---|---|
| Unified kickoff score | 69.5% |
| Exact match (all 3 heads) | 37.4% |
| complexity | 65.1% |
| task_type | 78.3% |
| risk | 65.1% |
Held-out test split (n=174, mirrors the training distribution):
| Head | Accuracy | Macro F1 |
|---|---|---|
| complexity | 67.8% | 68.1% |
| task_type | 86.8% | 87.2% |
| risk | 66.7% | 62.2% |
| Exact match | 39.1% | β |
Sub-dimension regression RΒ² (PyTorch model): reasoning_depth 0.51, scope_breadth 0.47, spec_completeness 0.34, domain_knowledge 0.30; reversal_cost 0.55, production_exposure 0.51, data_sensitivity 0.25, security_surface 0.19.
Calibration: per-head temperature scaling fit on validation. ECE on the held-out split is ~0.37 at the 0.8 automation threshold β confidence is not yet well-calibrated for aggressive automation; gate on it conservatively.
Intended use
- Pick a capability tier / provider for a coding task at kickoff, before the first expensive agent call.
- Drive a confirmation gate for high-blast-radius work (risk/security/prod).
- Spread work across tiers to reduce rate-limit pressure.
Out of scope: per-turn routing; non-coding prompts; high-stakes autonomous action without a human gate; languages other than English (trained on English).
Limitations
- Cold-start ceiling. Effort/blast-radius isn't fully derivable from prompt
text β
complexity=mediumandrisk=highare the weakest bands, especially on short imperatives. Production signals (overrides, retries, session duration) are the intended path past this; this checkpoint predates that loop. - Synthetic-label ceiling. Much training data is LLM-labeled; expect a ~75β80% ceiling per head until real disagreement signals are mixed in.
- Quantized serving tradeoff. The fp32 ONNX graph matches the PyTorch model on the locked battery, including sub-dimension scores. The int8 graph is the recommended low-dependency serving artifact and preserves the established v6 serving behavior, but dynamic quantization can move borderline labels and regression values.
Training
- Backbone
microsoft/deberta-v3-small, attention pooling, head dependencies, 3 softmax heads + 8 regression heads;current_text_onlyfeature mode. - 5 epochs, batch 16, encoder LR 2e-5, head LR 1e-4, weight decay 0.01, warmup 0.1, seed 13; post-hoc per-head temperature scaling on validation.
- Data: v6 mixed set (train 1141 / val 174 / test 174) β a mix of synthetic coding-task prompts and real coding-agent kickoff prompts. The merged training set is not distributed (it embeds third-party trace text and personal usage traces); the synthetic seed data and the full data pipeline are in the code repo. See "Training data provenance" below.
Training data provenance
Disclosed in full so downstream users can do their own diligence:
- Synthetic coding-task prompts (majority of the mix) β written by Claude sub-agents and hand-labeled; included in the code repo.
- SWE-bench problem statements β used only as Claude-paraphrased short-imperative prompts (no code, patches, or full issue text). The SWE-bench benchmark code is MIT; the aggregated issue text is owned by its authors and the HF dataset card carries no license tag.
- Public coding-agent trace datasets (
badlogicgames/pi-mono,armand0e/gpt-5.5-agent,lewtun/ml-intern-sessions) β kickoff prompts extracted and labeled. These carrylicense: otheror no license; their raw text is not redistributed here. - The author's own local agent traces β first-task prompts only; not redistributed.
- Labels and paraphrases were produced with Anthropic Claude; per Anthropic's Commercial Terms, outputs are customer-owned. No other provider's models were used for generation or labeling.
The model is a non-generative classifier (three softmax heads over 256-token inputs); it emits logits, not text, and cannot reproduce training data.
Credits & provenance
- Scaffolding began as a fork of tiny-router by Udara Jay (MIT); the ontology, data, heads, and serving path were rebuilt for coding-agent routing.
- Backbone: DeBERTa-v3 (He et al.;
microsoft/deberta-v3-small, MIT). - Related prior art: Vercel v0 Auto and NVIDIA's prompt-task-and-complexity-classifier.
- License: Apache-2.0 (weights), with the training-data provenance disclosed above; the training/serving code repo is MIT.
Model tree for afterbuild/spawn-router
Base model
microsoft/deberta-v3-small