YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ”‘ domainTokenizer

Building small models that understand domain tokens β€” not just words.


The Idea

LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day β€” purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

Text LLM:      "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat] β†’ Transformer β†’ next word

domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase

Quick Start

from domain_tokenizer import (
    DomainTokenizerBuilder, DomainTransformerConfig,
    DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA

# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)                                          # fit magnitude bins on training data
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

# 2. Prepare packed training data (100% token utilization, zero padding waste)
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

# 3. Create model (GPT-style, NoPE, pre-norm β€” 24M params)
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)

# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    hub_model_id="org/finance-24m",       # auto push to HF Hub
    num_epochs=10, learning_rate=3e-4,
    bf16=True,                             # A100/H100
    report_to="trackio",                   # live monitoring
)

# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
from domain_tokenizer import JointFusionModel
fusion = JointFusionModel(
    transformer_model=model,               # pre-trained, unfrozen
    n_tabular_features=291,                # hand-crafted tabular features
    n_classes=1,                           # binary: will user activate product?
)
# Train fusion model end-to-end on labeled data...

🏦 Industry Validation: Nubank's nuFormer

This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ ~14 tokens per transaction β†’ GPT-style Transformer (24M-330M params) β†’ +1.25% relative AUC over LightGBM (3Γ— their production launch threshold).

πŸ“„ Full reverse-engineering: docs/nubank_nuformer_analysis.md

Why This Matters

Problem Text Tokenizer Domain Tokenizer
Product SKU-8847291 Split into meaningless fragments: SK, U-, 884... Encoded as Semantic ID: [Electronics, 23, 7, 41] with hierarchical meaning
Price $79.99 Fragmented: $, 79, ., 99 Tokenized as price_bin_37 (73rd percentile = "mid-range")
Timestamp 2025-03-15 Calendar-unaware text fragments [Wednesday, Afternoon, 2_days_later]
Cross-field patterns Lost in flat token stream Discovered via BPE-like merging: {Electronics + $50-100} β†’ composite token

Documentation

Document Description
πŸ“„ docs/research_report.md Research survey β€” 31 papers across 5 paradigms, technical taxonomy, blueprint
🏦 docs/nubank_nuformer_analysis.md Nubank reverse-engineering β€” full pipeline reconstruction, 4 academic pillars
πŸ—οΈ docs/adr/ADR-001-implementation-framework.md Architecture Decision Record β€” PyTorch+HF vs JAX/Keras, trade-offs, roadmap
πŸ“Š docs/phase2_implementation_report.md Implementation report β€” Phase 2A-2C technical decisions, architecture, 124 tests

Project Roadmap

Phase 1: Research & Survey βœ…

  • Literature survey (35+ papers), Nubank reverse-engineering, framework ADR

Phase 2: Core Library βœ… (v0.3.0 β€” 124 tests passing)

  • 2A: Domain tokenizer library β€” schema, 5 field tokenizers, HF-compatible builder
  • 2B: Model architecture β€” DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
  • 2C: Pre-training pipeline β€” sequence packing, DataCollatorForLanguageModeling, HF Trainer
  • 2D: Fine-tuning pipeline (next)

Phase 3: Domain Demos

  • Finance: fraud detection, credit scoring on real data
  • E-commerce: next purchase prediction, customer segmentation

Phase 4: Scale & Optimize

  • 330M param models, torch.compile(), ONNX export, ActionPiece vocabulary

Repo Structure

src/domain_tokenizer/
β”œβ”€β”€ __init__.py                     # v0.3.0 β€” all public exports
β”œβ”€β”€ schema.py                       # DomainSchema, FieldSpec, FieldType
β”œβ”€β”€ tokenizers/
β”‚   β”œβ”€β”€ field_tokenizers.py         # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
β”‚   └── domain_tokenizer.py         # DomainTokenizerBuilder β†’ HF PreTrainedTokenizerFast
β”œβ”€β”€ schemas/
β”‚   └── predefined.py               # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ configuration.py            # DomainTransformerConfig (24M/85M/330M presets)
β”‚   β”œβ”€β”€ modeling.py                 # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
β”‚   β”œβ”€β”€ plr_embeddings.py           # PeriodicLinearReLU (Gorishniy et al. 2022)
β”‚   └── joint_fusion.py             # DCNv2 + JointFusionModel (nuFormer-style)
└── training/
    β”œβ”€β”€ data_pipeline.py            # tokenize β†’ pack β†’ HFDataset
    └── pretrain.py                 # pretrain_domain_model (HF Trainer)
tests/
β”œβ”€β”€ test_tokenizer.py               # 72 tests
β”œβ”€β”€ test_model.py                   # 33 tests
└── test_training.py                # 19 tests

Key References

Paper Year Role in domainTokenizer Link
nuFormer (Nubank) 2025 Overall architecture blueprint arXiv
NoPE 2023 No positional encoding β€” our attention design arXiv
PLR Embeddings (Yandex) 2022 Numerical feature embeddings arXiv
DCN V2 (Google) 2021 Tabular feature crossing in joint fusion arXiv
RecFormer 2023 Items-as-text tokenization philosophy arXiv
TIGER (Google) 2023 Semantic IDs via RQ-VAE arXiv
ActionPiece (DeepMind) 2025 BPE for user action sequences arXiv
Banking TF 2024 Transaction tokenizer for French banking arXiv
Nested Learning (HOPE) 2025 Continual learning via multi-timescale memory arXiv

Full reference table (35+ papers): docs/research_report.md

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for rtferraz/domainTokenizer