YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π domainTokenizer
Building small models that understand domain tokens β not just words.
The Idea
LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.
But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
Quick Start
from domain_tokenizer import (
DomainTokenizerBuilder, DomainTransformerConfig,
DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA
# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events) # fit magnitude bins on training data
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
# 2. Prepare packed training data (100% token utilization, zero padding waste)
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
# 3. Create model (GPT-style, NoPE, pre-norm β 24M params)
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)
# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
pretrain_domain_model(
model, hf_tokenizer, dataset,
hub_model_id="org/finance-24m", # auto push to HF Hub
num_epochs=10, learning_rate=3e-4,
bf16=True, # A100/H100
report_to="trackio", # live monitoring
)
# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
from domain_tokenizer import JointFusionModel
fusion = JointFusionModel(
transformer_model=model, # pre-trained, unfrozen
n_tabular_features=291, # hand-crafted tabular features
n_classes=1, # binary: will user activate product?
)
# Train fusion model end-to-end on labeled data...
π¦ Industry Validation: Nubank's nuFormer
This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
- Paper: "Your spending needs attention: Modeling financial habits with transformers" (arXiv: 2507.23267)
- Blog series: Building Nubank β Foundation Models
Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β ~14 tokens per transaction β GPT-style Transformer (24M-330M params) β +1.25% relative AUC over LightGBM (3Γ their production launch threshold).
π Full reverse-engineering: docs/nubank_nuformer_analysis.md
Why This Matters
| Problem | Text Tokenizer | Domain Tokenizer |
|---|---|---|
Product SKU-8847291 |
Split into meaningless fragments: SK, U-, 884... |
Encoded as Semantic ID: [Electronics, 23, 7, 41] with hierarchical meaning |
Price $79.99 |
Fragmented: $, 79, ., 99 |
Tokenized as price_bin_37 (73rd percentile = "mid-range") |
Timestamp 2025-03-15 |
Calendar-unaware text fragments | [Wednesday, Afternoon, 2_days_later] |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: {Electronics + $50-100} β composite token |
Documentation
| Document | Description |
|---|---|
π docs/research_report.md |
Research survey β 31 papers across 5 paradigms, technical taxonomy, blueprint |
π¦ docs/nubank_nuformer_analysis.md |
Nubank reverse-engineering β full pipeline reconstruction, 4 academic pillars |
ποΈ docs/adr/ADR-001-implementation-framework.md |
Architecture Decision Record β PyTorch+HF vs JAX/Keras, trade-offs, roadmap |
π docs/phase2_implementation_report.md |
Implementation report β Phase 2A-2C technical decisions, architecture, 124 tests |
Project Roadmap
Phase 1: Research & Survey β
- Literature survey (35+ papers), Nubank reverse-engineering, framework ADR
Phase 2: Core Library β (v0.3.0 β 124 tests passing)
- 2A: Domain tokenizer library β schema, 5 field tokenizers, HF-compatible builder
- 2B: Model architecture β DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
- 2C: Pre-training pipeline β sequence packing, DataCollatorForLanguageModeling, HF Trainer
- 2D: Fine-tuning pipeline (next)
Phase 3: Domain Demos
- Finance: fraud detection, credit scoring on real data
- E-commerce: next purchase prediction, customer segmentation
Phase 4: Scale & Optimize
- 330M param models,
torch.compile(), ONNX export, ActionPiece vocabulary
Repo Structure
src/domain_tokenizer/
βββ __init__.py # v0.3.0 β all public exports
βββ schema.py # DomainSchema, FieldSpec, FieldType
βββ tokenizers/
β βββ field_tokenizers.py # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
β βββ domain_tokenizer.py # DomainTokenizerBuilder β HF PreTrainedTokenizerFast
βββ schemas/
β βββ predefined.py # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
βββ models/
β βββ configuration.py # DomainTransformerConfig (24M/85M/330M presets)
β βββ modeling.py # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
β βββ plr_embeddings.py # PeriodicLinearReLU (Gorishniy et al. 2022)
β βββ joint_fusion.py # DCNv2 + JointFusionModel (nuFormer-style)
βββ training/
βββ data_pipeline.py # tokenize β pack β HFDataset
βββ pretrain.py # pretrain_domain_model (HF Trainer)
tests/
βββ test_tokenizer.py # 72 tests
βββ test_model.py # 33 tests
βββ test_training.py # 19 tests
Key References
| Paper | Year | Role in domainTokenizer | Link |
|---|---|---|---|
| nuFormer (Nubank) | 2025 | Overall architecture blueprint | arXiv |
| NoPE | 2023 | No positional encoding β our attention design | arXiv |
| PLR Embeddings (Yandex) | 2022 | Numerical feature embeddings | arXiv |
| DCN V2 (Google) | 2021 | Tabular feature crossing in joint fusion | arXiv |
| RecFormer | 2023 | Items-as-text tokenization philosophy | arXiv |
| TIGER (Google) | 2023 | Semantic IDs via RQ-VAE | arXiv |
| ActionPiece (DeepMind) | 2025 | BPE for user action sequences | arXiv |
| Banking TF | 2024 | Transaction tokenizer for French banking | arXiv |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | arXiv |
Full reference table (35+ papers): docs/research_report.md
License
MIT