AGILLM-3-Large v2

Continuation of AGILLM-3-Large training, restarted from a known-good checkpoint after a critical tokenizer bug was discovered.

Demo Space: OpenTransformer/AGILLM-3-large-v2-demo

What happened to v1?

A transformers library update (to v5.3.0) on 2026-03-11 silently broke the DeepSeek-V3.2 tokenizer's encode/decode pipeline:

  • Root cause: The tokenizer's Metaspace pre-tokenizer was configured to use (U+2581, SentencePiece convention) for space replacement, but the BPE vocabulary uses Ġ (U+0120, GPT-2 convention). This mismatch caused:

    • Encoding: All spaces were silently dropped. "Water boils" encoded to ['Water', 'bo', 'ils'] instead of ['ĠWater', 'Ġboils']
    • Decoding: tok.decode() lost all spaces. Round-trip encode→decode of "The meaning of life" returned "Themeaningoflife"
    • Training data corruption: ~3 billion tokens of training data were fed to the model without any space information, causing the model weights to degrade
  • Detection: Model output went from coherent English (step 12,373,125) to space-less garbled text (step 12,528,061+). Took investigation to trace back to the tokenizer library update.

  • Fix: Pinned transformers==4.48.0 (+ tokenizers==0.21.4), which correctly handles the Ġ space prefix. Also added a runtime fix in n.py that patches the ▁→Ġ mismatch if detected.

This repo

Resumes training from step 12,373,125 (~10.83B tokens, 30.9%) — the last checkpoint with correctly-encoded training data.

Model

Parameter Value
Parameters 698M
Hidden dim 1024
Layers 24
Heads 16
Rank 128
Expansion ratio 2.0x
Vocab 128,815 (DeepSeek-V3.2 tokenizer)
Architecture Joint AR + SAT (autoregressive + span-aware transformer)
Training target 35B tokens
Tokens seen (at restart) ~10.83B (30.9%)

Training setup

  • GPU: RTX 4090 (Vast.ai, ~$0.27/hr)
  • Speed: ~20,000 tok/s
  • Block size: 1122
  • Batch size: 1
  • Mixed precision: AMP (BF16)
  • Optimizer: AdamW (LR core=5e-5, LR head=2e-4)
  • Data: Streamed from multiple HuggingFace datasets (web crawl, cleaned text)

Important: Tokenizer compatibility

This model requires transformers<=4.48.0 for correct tokenizer behavior:

pip install transformers==4.48.0 tokenizers==0.21.4

Or use the runtime fix in n.py which auto-patches the ▁/Ġ mismatch.

Sample output (step 12,373,125)

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees in the 1990s, and a year after that. "It's not just that, but it makes you think." He said: "I don't think I'm going to make a deal for the rest of my life."

Latest smoke-test output (step 30,028,112)

Generated from pretrain_delta_step30028112.pt on 2026-05-16 after the tokenizer health check passed with transformers==4.48.0 and tokenizers==0.21.4.

  • Tokens seen at checkpoint: saved between 33,185,096,055 and 33,186,096,759 / 35,000,000,000 tokens (~94.8%).
  • Live ETA snapshot: as of 2026-05-16 04:00:55 UTC, training was at 33,217,118,583 / 35,000,000,000 tokens (94.906053%) at 5,221 tok/s, with 1,782,881,417 tokens remaining. Pretraining was projected to finish at 2026-05-20 02:52:17 UTC (03:52:17 UK) if speed holds.

Prompt: "Water boils at one hundred degrees"

Water boils at one hundred degrees, and there is a challenge to the Doctor to receive the two-week term, before finally waiting for their upcoming term. The case comes as the justices are poised against it for an exemption because there is not what a Mississippi law that bars most

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using OpenTransformer/AGILLM-3-large-v2 1