HAGI Ablation - Model A - Dense Baseline (control)

TL;DR - dense baseline - the experimental control / floor. One arm of a four-model controlled ablation testing whether Clifford grade decomposition improves reasoning-per-parameter in a small language model. Headline result: it does not - on held-out validation the plain recurrent baseline (B) beats full GDR (D) on every seed. This card documents this arm and links the rest of the family so you can traverse the whole experiment.

One of four same-budget models in the HAGI Grade-Decomposed Recurrence (GDR) ablation. All four share data, tokenizer, schedule, and token budget and differ in only two flags - use_loop and use_gdr - so the comparison isolates the mechanism with no confounds.

Research artifact, not a product. ~114M parameters, trained on 500M tokens (well under Chinchilla-optimal for this size). It writes fluent, on-topic English but its facts are unreliable. Its purpose is a controlled scientific comparison.

This model


Role	dense baseline - the experimental control / floor
Recurrence (`use_loop`)	none
Clifford GDR (`use_gdr`)	none
Parameters	113.3M
Checkpoint	`step-00007630.pt` (model + optimizer state + config dict)
Tokens seen	500M (sequence length 1024)
Eval loss / perplexity	3.4880 / 32.72 (shared 819,200-token set, seed 42)

What is HAGI / GDR?

HAGI tests one hypothesis: does decomposing a recurrent transformer's hidden state into Clifford-algebra grades - scalar / vector / bivector / trivector, each with its own update rate, plus a geometric-product cross-grade interaction - improve reasoning-per-parameter in small models?

A standard recurrent-depth transformer iterates over a flat hidden vector and gains plateau after a few iterations because every dimension converges at the same rate. GDR splits the state into grades that evolve at different rates, so each reasoning iteration has distinct dynamics.

The ablation (where this model fits)

Model	Recurrence	Clifford GDR	Params
A (this model)	-	-	113.3M
B	loop x3	-	113.3M
C	-	grades	114.6M
D	loop x3	grades	114.6M

Decisive comparison: B vs D - identical parameters and compute pattern; the only difference is grade decomposition. Secondary: C vs D (integrated vs bolted-on Clifford). A is the floor. Read with the gates in docs/ABLATION.md.

Architecture (shared by all four)

Component	Value
Hidden size	768
Layers	12 (4 perception / 4 reasoning / 4 expression)
Attention	Grouped-Query Attention (12 query heads, 4 KV heads)
MLP	SwiGLU (768 -> 2048 -> 768)
Positional	RoPE (theta=10000)
Norm	RMSNorm (pre-norm)
Embeddings	weight-tied input/output
Vocabulary	49,152 (SmolLM2 tokenizer)
Sequence length	1024
Precision	bf16

Models C and D add the GDR grade-update MLPs + geometric product (+~1.3M params -> 114.6M). Models B and D loop the reasoning core 3x per forward.

Training (shared)


Data	FineWeb-Edu `sample-10BT`
Tokens	500M (~7,629 steps)
Optimizer	AdamW (lr 3e-4, wd 0.1, cosine decay, 400-step warmup)
Effective batch	65,536 tokens/step (batch 16 x grad-accum 4 x seq 1024)
Hardware	Google Colab A100-40GB (bf16 + FlashAttention + `torch.compile`)

Load and run (free, CPU)

git clone -b experimental https://github.com/ShmidtS/HAGI.git && cd HAGI
pip install -r requirements.txt
python scripts/generate.py --hf-repo NAME0x0/hagi-ablation-a \
    --prompt "The sun is a star that" --device cpu

Inference fits in <1GB - no GPU needed. Or load the checkpoint directly:

import torch, torch.nn.functional as F
from huggingface_hub import hf_hub_download
from prototype.training.loop import load_checkpoint
from prototype.data.tokenizer import load_tokenizer

ckpt = hf_hub_download("NAME0x0/hagi-ablation-a", "step-00007630.pt")
model, step = load_checkpoint(ckpt, device="cpu"); model.eval()
tok = load_tokenizer("HuggingFaceTB/SmolLM2-135M")

x = torch.tensor([tok.encode("The sun is a star that")])
for _ in range(60):
    with torch.no_grad():
        logits = model(x)[0, -1]
    x = torch.cat([x, torch.multinomial(F.softmax(logits / 0.8, -1), 1).view(1, 1)], dim=1)
print(tok.decode(x[0].tolist()))

Model family (click to traverse)

All models below share data, tokenizer, schedule, and token budget. Stage 0 is the separate pretraining-baseline track; A/B/C/D are this controlled ablation.

Model	Role
Stage 0	pretraining baseline (separate track, not part of this ablation)
A - you are here	dense baseline - the experimental control / floor
B	recurrence alone (looped reasoning core, flat hidden state)
C	Clifford grade decomposition WITHOUT recurrence
D	loop + grade decomposition (the full mechanism under test)

Results

All four scored on the same 819,200-token fixed batch set (identical batches; lower loss is better). One run per model, training seed 42.

Model	Params	Loss	Perplexity
A (this)	113.3M	3.4880	32.72
B	113.3M	3.4852	32.63
C	114.6M	3.4771	32.37
D	114.6M	3.4702	32.14

B vs D: loss D-B = -0.0150 (negative = grade decomposition helps).
C vs D: loss D-C = -0.0069 (negative = integrated GDR beats bolted-on Clifford).
Ordering A < B < C < D matches the hypothesis: the grade decomposition carries the signal; recurrence helps mainly in its presence (loop-alone B barely moves A).

These are train-set numbers (one run per model). They did NOT hold up: on held-out validation across 5 seeds (note below), the ranking reverses - B beats D every time (mean D-B = +0.0175). The small train-set D-advantage does not generalize, consistent with mild overfitting by D's extra parameters. Conclusion: grade decomposition does not improve held-out loss at this scale. A negative result - see docs/ABLATION.md for the gates.

This model's sample (prompt "The sun is a star that", temperature 0.8, seed 42):

The sun is a star that rises in the sky, then rises in the sky, eventually the sun. The two stars, the sun and the moon, are equally the same but the earth is smaller in size and has more diameter

Seed stability: across 5 seeds on held-out validation (shard_00006.bin), D beat B in 0/5 runs (mean loss D-B = +0.0175).

Geometry diagnostic (follow-up)

To locate why full GDR (D) lost to recurrence-only (B), a third arm - D_nogeo, GDR with the geometric-product cross-grade term switched off (same 114.6M parameters as D) - was trained head-to-head with B and D on the same held-out shard (Kaggle T4, fp16, 1 seed: [1]).

Comparison	Mean held-out loss delta (lower = better)
D - B (full GDR vs recurrence-only)	+0.0210
D_nogeo - B (grades, no geometric product)	+0.0176

Removing the geometric product does not recover the gap - the grade machinery itself hurts at this scale, independent of the geometric product. GDR-as-built is falsified here; the path forward is a paper-faithful rebuild or a pivot to capability.

NAME0x0
/

hagi-ablation-a