HAGI Ablation - Model A - Dense Baseline (control)
TL;DR - dense baseline - the experimental control / floor. One arm of a four-model controlled ablation testing whether Clifford grade decomposition improves reasoning-per-parameter in a small language model. Headline result: it does not - on held-out validation the plain recurrent baseline (B) beats full GDR (D) on every seed. This card documents this arm and links the rest of the family so you can traverse the whole experiment.
One of four same-budget models in the HAGI Grade-Decomposed Recurrence (GDR)
ablation. All four share data, tokenizer, schedule, and token budget and differ in
only two flags - use_loop and use_gdr - so the comparison isolates the
mechanism with no confounds.
Research artifact, not a product. ~114M parameters, trained on 500M tokens (well under Chinchilla-optimal for this size). It writes fluent, on-topic English but its facts are unreliable. Its purpose is a controlled scientific comparison.
This model
| Role | dense baseline - the experimental control / floor |
Recurrence (use_loop) |
none |
Clifford GDR (use_gdr) |
none |
| Parameters | 113.3M |
| Checkpoint | step-00007630.pt (model + optimizer state + config dict) |
| Tokens seen | 500M (sequence length 1024) |
| Eval loss / perplexity | 3.4880 / 32.72 (shared 819,200-token set, seed 42) |
What is HAGI / GDR?
HAGI tests one hypothesis: does decomposing a recurrent transformer's hidden state into Clifford-algebra grades - scalar / vector / bivector / trivector, each with its own update rate, plus a geometric-product cross-grade interaction - improve reasoning-per-parameter in small models?
A standard recurrent-depth transformer iterates over a flat hidden vector and gains plateau after a few iterations because every dimension converges at the same rate. GDR splits the state into grades that evolve at different rates, so each reasoning iteration has distinct dynamics.
The ablation (where this model fits)
| Model | Recurrence | Clifford GDR | Params |
|---|---|---|---|
| A (this model) | - | - | 113.3M |
| B | loop x3 | - | 113.3M |
| C | - | grades | 114.6M |
| D | loop x3 | grades | 114.6M |
Decisive comparison: B vs D - identical parameters and compute pattern; the only
difference is grade decomposition. Secondary: C vs D (integrated vs bolted-on
Clifford). A is the floor. Read with the gates in docs/ABLATION.md.
Architecture (shared by all four)
| Component | Value |
|---|---|
| Hidden size | 768 |
| Layers | 12 (4 perception / 4 reasoning / 4 expression) |
| Attention | Grouped-Query Attention (12 query heads, 4 KV heads) |
| MLP | SwiGLU (768 -> 2048 -> 768) |
| Positional | RoPE (theta=10000) |
| Norm | RMSNorm (pre-norm) |
| Embeddings | weight-tied input/output |
| Vocabulary | 49,152 (SmolLM2 tokenizer) |
| Sequence length | 1024 |
| Precision | bf16 |
Models C and D add the GDR grade-update MLPs + geometric product (+~1.3M params -> 114.6M). Models B and D loop the reasoning core 3x per forward.
Training (shared)
| Data | FineWeb-Edu sample-10BT |
| Tokens | 500M (~7,629 steps) |
| Optimizer | AdamW (lr 3e-4, wd 0.1, cosine decay, 400-step warmup) |
| Effective batch | 65,536 tokens/step (batch 16 x grad-accum 4 x seq 1024) |
| Hardware | Google Colab A100-40GB (bf16 + FlashAttention + torch.compile) |
Load and run (free, CPU)
git clone -b experimental https://github.com/ShmidtS/HAGI.git && cd HAGI
pip install -r requirements.txt
python scripts/generate.py --hf-repo NAME0x0/hagi-ablation-a \
--prompt "The sun is a star that" --device cpu
Inference fits in <1GB - no GPU needed. Or load the checkpoint directly:
import torch, torch.nn.functional as F
from huggingface_hub import hf_hub_download
from prototype.training.loop import load_checkpoint
from prototype.data.tokenizer import load_tokenizer
ckpt = hf_hub_download("NAME0x0/hagi-ablation-a", "step-00007630.pt")
model, step = load_checkpoint(ckpt, device="cpu"); model.eval()
tok = load_tokenizer("HuggingFaceTB/SmolLM2-135M")
x = torch.tensor([tok.encode("The sun is a star that")])
for _ in range(60):
with torch.no_grad():
logits = model(x)[0, -1]
x = torch.cat([x, torch.multinomial(F.softmax(logits / 0.8, -1), 1).view(1, 1)], dim=1)
print(tok.decode(x[0].tolist()))
Model family (click to traverse)
All models below share data, tokenizer, schedule, and token budget. Stage 0 is the separate pretraining-baseline track; A/B/C/D are this controlled ablation.
| Model | Role |
|---|---|
| Stage 0 | pretraining baseline (separate track, not part of this ablation) |
| A - you are here | dense baseline - the experimental control / floor |
| B | recurrence alone (looped reasoning core, flat hidden state) |
| C | Clifford grade decomposition WITHOUT recurrence |
| D | loop + grade decomposition (the full mechanism under test) |
Results
All four scored on the same 819,200-token fixed batch set (identical batches; lower loss is better). One run per model, training seed 42.
| Model | Params | Loss | Perplexity |
|---|---|---|---|
| A (this) | 113.3M | 3.4880 | 32.72 |
| B | 113.3M | 3.4852 | 32.63 |
| C | 114.6M | 3.4771 | 32.37 |
| D | 114.6M | 3.4702 | 32.14 |
- B vs D: loss D-B = -0.0150 (negative = grade decomposition helps).
- C vs D: loss D-C = -0.0069 (negative = integrated GDR beats bolted-on Clifford).
- Ordering A < B < C < D matches the hypothesis: the grade decomposition carries the signal; recurrence helps mainly in its presence (loop-alone B barely moves A).
These are train-set numbers (one run per model). They did NOT hold up: on
held-out validation across 5 seeds (note below), the ranking reverses - B
beats D every time (mean D-B = +0.0175). The small train-set D-advantage does not
generalize, consistent with mild overfitting by D's extra parameters. Conclusion:
grade decomposition does not improve held-out loss at this scale. A negative
result - see docs/ABLATION.md for the gates.
This model's sample (prompt "The sun is a star that", temperature 0.8, seed 42):
The sun is a star that rises in the sky, then rises in the sky, eventually the sun. The two stars, the sun and the moon, are equally the same but the earth is smaller in size and has more diameter
Seed stability: across 5 seeds on held-out validation (shard_00006.bin), D beat B in 0/5 runs (mean loss D-B = +0.0175).
Geometry diagnostic (follow-up)
To locate why full GDR (D) lost to recurrence-only (B), a third arm - D_nogeo,
GDR with the geometric-product cross-grade term switched off (same 114.6M
parameters as D) - was trained head-to-head with B and D on the same held-out shard
(Kaggle T4, fp16, 1 seed: [1]).
| Comparison | Mean held-out loss delta (lower = better) |
|---|---|
| D - B (full GDR vs recurrence-only) | +0.0210 |
| D_nogeo - B (grades, no geometric product) | +0.0176 |
Removing the geometric product does not recover the gap - the grade machinery itself hurts at this scale, independent of the geometric product. GDR-as-built is falsified here; the path forward is a paper-faithful rebuild or a pivot to capability.
Links
- Code: https://github.com/ShmidtS/HAGI (branch
experimental) - Spec:
docs/ARCHITECTURE.md,docs/MILESTONES.md,docs/ABLATION.md - License: Apache 2.0