ezellm-lite-tokenizer

A 24,600-vocab byte-level BPE tokenizer trained on a 142 GB code-heavy corpus, with a Qwen2.5-Coder-style special-token layout (FIM + repo metadata + reserved slots). Designed for small-to-mid scale code language models where embedding-table size matters.

This is the v2 of the tokenizer.

Quick start

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("TerminatorPower/ezellm-lite-tokenizer")

ids = tok.encode("def hello(name):\n    print(f'Hello, {name}!')\n")
print(len(ids), tok.decode(ids))

Vocabulary layout

Total vocab: 24,600. The first 24,576 IDs are learned BPE merges; the top 24 IDs are reserved for control tokens.

ID range Tokens Purpose
0 – 24,575 Learned BPE pieces Text/code
24,576 <|endoftext|> EOS + PAD
24,577 – 24,580 <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>, <|fim_pad|> Fill-in-the-Middle training
24,581 – 24,583 <|file_sep|>, <|repo_name|>, <|filename|> Repo-level packing
24,584 – 24,599 <|reserved_0|> … <|reserved_15|> Reserved for downstream use

Only <|endoftext|> is registered as eos_token / pad_token in special_tokens_map. The FIM and repo markers are added tokens but are not flagged "special" β€” tok.decode(ids, skip_special_tokens=True) will only strip <|endoftext|>. Register the others explicitly if you want them stripped on decode.

Training data

  • Size: ~142 GB of text and source code
  • Mix (heavily code-leaning): Python, JavaScript/TypeScript, Java, C/C++, web (HTML/CSS), Markdown, math/scientific Python, technical prose
  • Algorithm: Byte-level BPE (tiktoken-compatible; tiktoken.bpe and tiktoken.json are bundled alongside the standard tokenizer.json)

The tokenizer files in this repo can be loaded both via πŸ€— transformers (tokenizer.json) and via tiktoken directly (tiktoken.bpe).

Benchmarks

Compared against size-matched, code-trained tokenizers in the 24K–50K vocab range. Measured on a held-out multi-domain corpus (~618 KB across 10 categories: C, C++, Java, JavaScript, Markdown, math/Python, prose, general Python, HTML/CSS, raw outputs).

Aggregate compression

Higher chars/token = better compression = shorter context for the same input.

Tokenizer Vocab chars/token tokens / 1k chars
StarCoder2 49,152 3.238 308.9
ezellm-lite 24,600 3.081 324.6
CodeGen-mono 50,295 3.017 331.5
DeepSeek-Coder 32,022 2.836 352.6
GPT-2 50,257 2.487 402.1

ezellm-lite is the second-best tokenizer in this group β€” within ~5% of StarCoder2 despite having half the vocabulary, ahead of CodeGen-mono and DeepSeek-Coder which both have 30–100% more vocab slots, and ~24% more compressive than GPT-2.

Compression by category β€” characters per token

Category ezellm-lite (24.6K) StarCoder2 (49K) CodeGen-mono (50K) DeepSeek-Coder (32K) GPT-2 (50K)
c 2.839 2.900 2.630 2.534 2.470
cpp 3.157 3.303 2.914 2.793 2.289
java 3.996 4.517 3.606 3.605 2.329
javascript 3.142 3.423 2.988 2.898 2.357
markdown_docs 3.117 3.272 3.125 2.906 2.986
math_python 2.630 2.695 2.482 2.387 2.136
prose 3.661 3.731 4.356 3.855 4.356
python_general 3.680 3.747 3.249 3.169 2.586
web_html_css 2.673 2.897 2.831 2.543 2.289

Reading the table. ezellm-lite either wins or comes within a few percent of the leader on every code category, beating CodeGen-mono and DeepSeek-Coder consistently on Python, JavaScript, Java, and C++. StarCoder2 edges it out on most categories (expected β€” 2Γ— the vocab, trained on The Stack v2). The one place ezellm-lite clearly trails is prose, where the prose-heavy GPT-2 vocabulary still wins β€” a deliberate trade for a code-focused tokenizer.

Efficiency per vocabulary slot

A 24K vocab tokenizer is at a structural disadvantage on raw chars/token. A fairer cross-vocab metric is chars/token Γ— logβ‚‚(vocab) β€” roughly "input bits carried per token."

Tokenizer Vocab chars/tok chars/tok Γ— logβ‚‚(V)
StarCoder2 49,152 3.238 50.46
CodeGen-mono 50,295 3.017 47.12
ezellm-lite 24,600 3.081 44.94
DeepSeek-Coder 32,022 2.836 42.45
GPT-2 50,257 2.487 38.84

By this measure ezellm-lite sits above CodeGen-mono on raw compression while using less than half the embedding parameters, and clearly above DeepSeek-Coder and GPT-2.

At d_model=1024, the embedding/output table sizes are: ezellm-lite 25M, DeepSeek-Coder 33M, StarCoder2 / CodeGen-mono / GPT-2 ~50M. For a small code LM (≀2B params) those tens of millions of softmax parameters are real budget.

Encoding speed

All five tokenizers are sub-millisecond on the 60–90 KB sample shards in this benchmark; encoding speed is not the bottleneck for any of them.

Files

File Purpose
tokenizer.json πŸ€— Tokenizers / transformers-loadable model
tokenizer_config.json Special-token metadata for transformers
tiktoken.bpe tiktoken-format merge table
tiktoken.json tiktoken metadata (pattern, special tokens)

Intended use

  • Pretraining / fine-tuning small-to-mid code LMs (≀2B params) where vocabulary size dominates parameter count
  • FIM-style training out of the box (FIM specials are pre-allocated)
  • Repo-aware packing using <|repo_name|>, <|filename|>, <|file_sep|>

Limitations

  • Not optimized for non-English natural language; the corpus is English + code.
  • Compression on dense punctuation languages (C, HTML/CSS) is noticeably tighter than for Python/prose β€” budget context length accordingly.
  • The 16 reserved slots are unused β€” they will never appear in trained text and need to be registered explicitly if you want to repurpose them (e.g. chat-template tags).

License

Apache-2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support