🔑 domainTokenizer

Building small models that understand domain tokens — not just words.

The Idea

LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

Text LLM:      "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase

Quick Start

from domain_tokenizer import (
    DomainTokenizerBuilder, DomainTransformerConfig,
    DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA

# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)                                          # fit magnitude bins on training data
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

# 2. Prepare packed training data (100% token utilization, zero padding waste)
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

# 3. Create model (GPT-style, NoPE, pre-norm — 24M params)
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)

# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    hub_model_id="org/finance-24m",       # auto push to HF Hub
    num_epochs=10, learning_rate=3e-4,
    bf16=True,                             # A100/H100
    report_to="trackio",                   # live monitoring
)

# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
from domain_tokenizer import JointFusionModel
fusion = JointFusionModel(
    transformer_model=model,               # pre-trained, unfrozen
    n_tabular_features=291,                # hand-crafted tabular features
    n_classes=1,                           # binary: will user activate product?
)
# Train fusion model end-to-end on labeled data...

🏦 Industry Validation: Nubank's nuFormer

This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

Paper: "Your spending needs attention: Modeling financial habits with transformers" (arXiv: 2507.23267)
Blog series: Building Nubank — Foundation Models

Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → ~14 tokens per transaction → GPT-style Transformer (24M-330M params) → +1.25% relative AUC over LightGBM (3× their production launch threshold).

📄 Full reverse-engineering: docs/nubank_nuformer_analysis.md

Why This Matters

Problem	Text Tokenizer	Domain Tokenizer
Product `SKU-8847291`	Split into meaningless fragments: `SK`, `U-`, `884`...	Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning
Price `$79.99`	Fragmented: `$`, `79`, `.`, `99`	Tokenized as `price_bin_37` (73rd percentile = "mid-range")
Timestamp `2025-03-15`	Calendar-unaware text fragments	`[Wednesday, Afternoon, 2_days_later]`
Cross-field patterns	Lost in flat token stream	Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token

Documentation

Document	Description
📄 `docs/research_report.md`	Research survey — 31 papers across 5 paradigms, technical taxonomy, blueprint
🏦 `docs/nubank_nuformer_analysis.md`	Nubank reverse-engineering — full pipeline reconstruction, 4 academic pillars
🏗️ `docs/adr/ADR-001-implementation-framework.md`	Architecture Decision Record — PyTorch+HF vs JAX/Keras, trade-offs, roadmap
📊 `docs/phase2_implementation_report.md`	Implementation report — Phase 2A-2C technical decisions, architecture, 124 tests

Project Roadmap

Phase 1: Research & Survey ✅

Literature survey (35+ papers), Nubank reverse-engineering, framework ADR

Phase 2: Core Library ✅ (v0.3.0 — 124 tests passing)

2A: Domain tokenizer library — schema, 5 field tokenizers, HF-compatible builder
2B: Model architecture — DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
2C: Pre-training pipeline — sequence packing, DataCollatorForLanguageModeling, HF Trainer
2D: Fine-tuning pipeline (next)

Phase 3: Domain Demos

Finance: fraud detection, credit scoring on real data
E-commerce: next purchase prediction, customer segmentation

Phase 4: Scale & Optimize

330M param models, torch.compile(), ONNX export, ActionPiece vocabulary

Repo Structure

src/domain_tokenizer/
├── __init__.py                     # v0.3.0 — all public exports
├── schema.py                       # DomainSchema, FieldSpec, FieldType
├── tokenizers/
│   ├── field_tokenizers.py         # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
│   └── domain_tokenizer.py         # DomainTokenizerBuilder → HF PreTrainedTokenizerFast
├── schemas/
│   └── predefined.py               # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
├── models/
│   ├── configuration.py            # DomainTransformerConfig (24M/85M/330M presets)
│   ├── modeling.py                 # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
│   ├── plr_embeddings.py           # PeriodicLinearReLU (Gorishniy et al. 2022)
│   └── joint_fusion.py             # DCNv2 + JointFusionModel (nuFormer-style)
└── training/
    ├── data_pipeline.py            # tokenize → pack → HFDataset
    └── pretrain.py                 # pretrain_domain_model (HF Trainer)
tests/
├── test_tokenizer.py               # 72 tests
├── test_model.py                   # 33 tests
└── test_training.py                # 19 tests

Key References

Paper	Year	Role in domainTokenizer	Link
nuFormer (Nubank)	2025	Overall architecture blueprint	arXiv
NoPE	2023	No positional encoding — our attention design	arXiv
PLR Embeddings (Yandex)	2022	Numerical feature embeddings	arXiv
DCN V2 (Google)	2021	Tabular feature crossing in joint fusion	arXiv
RecFormer	2023	Items-as-text tokenization philosophy	arXiv
TIGER (Google)	2023	Semantic IDs via RQ-VAE	arXiv
ActionPiece (DeepMind)	2025	BPE for user action sequences	arXiv
Banking TF	2024	Transaction tokenizer for French banking	arXiv
Nested Learning (HOPE)	2025	Continual learning via multi-timescale memory	arXiv