ezellm-lite-tokenizer
A 24,600-vocab byte-level BPE tokenizer trained on a 142 GB code-heavy corpus, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.
This is the v2 of the tokenizer.
Quick start
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")
ids = tok.encode("def hello(name):\n print(f'Hello, {name}!')\n")
print(len(ids), tok.decode(ids))
Vocabulary layout
Total vocab: 24,600. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.
| ID range | Tokens | Purpose |
|---|---|---|
| 0 β 24,575 | Learned BPE pieces | Text/code |
| 24,576 | <|endoftext|> |
EOS + PAD |
| 24,577 β 24,580 | <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>, <|fim_pad|> |
Fill-in-the-Middle training |
| 24,581 β 24,583 | <|file_sep|>, <|repo_name|>, <|filename|> |
Repo-level packing |
| 24,584 β 24,599 | <|reserved_0|> β¦ <|reserved_15|> |
Reserved for downstream use |
Only <|endoftext|> is registered as eos_token / pad_token in special_tokens_map. The FIM and repo markers are added tokens but are not flagged "special" β tok.decode(ids, skip_special_tokens=True) will only strip <|endoftext|>. Register the others explicitly if you want them stripped on decode.
Training data
- Size: ~142 GB of text and source code
- Mix (heavily code-leaning): Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
- Algorithm: Byte-level BPE (tiktoken-compatible;
tiktoken.bpeandtiktoken.jsonare bundled alongside the standardtokenizer.json)
The tokenizer files in this repo can be loaded both via π€ transformers (tokenizer.json) and via tiktoken directly (tiktoken.bpe).
Benchmarks
Compared against size-matched, code-trained tokenizers in the 24Kβ50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).
Aggregate compression
Higher chars/token = better compression = shorter context for the same input.
| Tokenizer | Vocab | chars/token | tokens / 1k chars |
|---|---|---|---|
| StarCoder2 | 49,152 | 3.238 | 308.9 |
| ezellm-lite | 24,600 | 3.081 | 324.6 |
| CodeGen-mono | 50,295 | 3.017 | 331.5 |
| DeepSeek-Coder | 32,022 | 2.836 | 352.6 |
| GPT-2 | 50,257 | 2.487 | 402.1 |
ezellm-lite is the second-best tokenizer in this group β within ~5% of StarCoder2 despite having half the vocabulary, ahead of CodeGen-mono and DeepSeek-Coder which both have 30β100% more vocab slots, and ~24% more compressive than GPT-2.
Compression by category β characters per token
| Category | ezellm-lite (24.6K) | StarCoder2 (49K) | CodeGen-mono (50K) | DeepSeek-Coder (32K) | GPT-2 (50K) |
|---|---|---|---|---|---|
| c | 2.839 | 2.900 | 2.630 | 2.534 | 2.470 |
| cpp | 3.157 | 3.303 | 2.914 | 2.793 | 2.289 |
| java | 3.996 | 4.517 | 3.606 | 3.605 | 2.329 |
| javascript | 3.142 | 3.423 | 2.988 | 2.898 | 2.357 |
| markdown_docs | 3.117 | 3.272 | 3.125 | 2.906 | 2.986 |
| math_python | 2.630 | 2.695 | 2.482 | 2.387 | 2.136 |
| prose | 3.661 | 3.731 | 4.356 | 3.855 | 4.356 |
| python_general | 3.680 | 3.747 | 3.249 | 3.169 | 2.586 |
| web_html_css | 2.673 | 2.897 | 2.831 | 2.543 | 2.289 |
Reading the table. ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected β 2Γ the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is prose, where the prose-heavy GPT-2 vocabulary still wins β a deliberate trade for a code-focused tokenizer.
Efficiency per vocabulary slot
A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is chars/token Γ logβ(vocab) β roughly "input bits carried per token."
| Tokenizer | Vocab | chars/tok | chars/tok Γ logβ(V) |
|---|---|---|---|
| StarCoder2 | 49,152 | 3.238 | 50.46 |
| CodeGen-mono | 50,295 | 3.017 | 47.12 |
| ezellm-lite | 24,600 | 3.081 | 44.94 |
| DeepSeek-Coder | 32,022 | 2.836 | 42.45 |
| GPT-2 | 50,257 | 2.487 | 38.84 |
By this measure ezellm-lite sits above CodeGen-mono on raw compression while using less than half the embedding parameters, and clearly above DeepSeek-Coder and GPT-2.
At d_model=1024, the embedding/output table sizes are: ezellm-lite 25M, DeepSeek-Coder 33M, StarCoder2 / CodeGen-mono / GPT-2 ~50M. For a small code LM (β€2B params) those tens of millions of softmax parameters are real budget.
Encoding speed
All five tokenizers are sub-millisecond on the 60β90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.
Files
| File | Purpose |
|---|---|
tokenizer.json |
π€ Tokenizers / transformers-loadable model |
tokenizer_config.json |
Special-token metadata for transformers |
tiktoken.bpe |
tiktoken-format merge table |
tiktoken.json |
tiktoken metadata (pattern, special tokens) |
Intended use
- Pretraining / fine-tuning small-to-mid code LMs (β€2B params) where vocabulary size dominates parameter count
- FIM-style training out of the box (FIM specials are pre-allocated)
- Repo-aware packing using
<|repo_name|>,<|filename|>,<|file_sep|>
Limitations
- Not optimized for non-English natural language; the corpus is English + code.
- Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose β budget context length accordingly.
- The 16 reserved slots are unused β they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).
License
Apache-2.0.