Text Generation
Transformers
English
tokenizer
bpe
byte-level
chatml
tool-use
code
python
conversational
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("JonathanMiddleton/daisy", dtype="auto")Quick Links
Daisy Tokenizer
Custom byte-level BPE tokenizer trained for the Daisy language model, optimized for Python code and instruction-following tasks.
Details
| Property | Value |
|---|---|
| Vocabulary size | 49,152 |
| Algorithm | Byte-level BPE |
| Pre-tokenizer | Llama-3 style regex |
| Chat format | ChatML |
| Max length | 131,072 tokens |
| Training date | 2026-01-14 |
Features
- Python-optimized: Trained on Python code for efficient tokenization
- Tool calling: Native support for
<|tool_call|>/<|tool_result|>patterns - Inline computation: Support for
<|python|>/<|output|>for calculator-style reasoning - Chain-of-thought:
<|think|>tokens for reasoning blocks - No UNK tokens: Byte-level fallback handles any Unicode input
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<|endoftext|> |
49131 | End of sequence / BOS |
<|pad|> |
49132 | Padding token |
<|im_start|> |
49133 | Start of message (ChatML) |
<|im_end|> |
49134 | End of message (ChatML) |
<|tool_call|> |
49135 | Start of tool call |
<|/tool_call|> |
49136 | End of tool call |
<|tool_result|> |
49137 | Start of tool result |
<|/tool_result|> |
49138 | End of tool result |
<|python|> |
49139 | Start of Python expression |
<|/python|> |
49140 | End of Python expression |
<|output|> |
49141 | Start of computed output |
<|/output|> |
49142 | End of computed output |
<|think|> |
49143 | Start of reasoning block |
<|/think|> |
49144 | End of reasoning block |
<|system|> |
49145 | System role marker |
<|user|> |
49146 | User role marker |
<|assistant|> |
49147 | Assistant role marker |
<|reserved_0|> |
49148 | Reserved |
<|reserved_1|> |
49149 | Reserved |
<|reserved_2|> |
49150 | Reserved |
<|reserved_3|> |
49151 | Reserved |
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jonathanmiddleton/daisy")
# Basic encoding
tokens = tokenizer.encode("Hello, world!")
# Chat formatting
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there! How can I help you?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
Chat Template Format
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_message}<|im_end|>
Tool Calling Example
<|im_start|>assistant
Let me calculate that for you.
<|tool_call|>{"name": "calculator", "arguments": {"expression": "2 + 2"}}<|/tool_call|>
<|tool_result|>4<|/tool_result|>
The answer is 4.<|im_end|>
Compression Ratios
Benchmarked against common tokenizers on Python code, prose, and instruction data:
Python Code (SmolTalk self-oss-instruct, 504 samples)
| Tokenizer | Vocab Size | Chars/Token | Tokens |
|---|---|---|---|
| meta-llama/Llama-3.2-3B-Instruct | 128,000 | 4.391 | 88,644 |
| Qwen/Qwen2.5-1.5B-Instruct | 151,643 | 4.366 | 89,139 |
| HuggingFaceTB/SmolLM2-135M-Instruct | 49,152 | 3.906 | 99,650 |
| JonathanMiddleton/daisy | 49,131 | 3.766 | 103,349 |
| microsoft/phi-2 | 50,257 | 3.628 | 107,290 |
| openai-community/gpt2 | 50,257 | 3.152 | 123,467 |
English Prose (FineWeb-Edu, 505 samples)
| Tokenizer | Vocab Size | Chars/Token | Tokens |
|---|---|---|---|
| meta-llama/Llama-3.2-3B-Instruct | 128,000 | 4.681 | 466,617 |
| JonathanMiddleton/daisy | 49,131 | 4.594 | 475,422 |
| openai-community/gpt2 | 50,257 | 4.584 | 476,460 |
| microsoft/phi-2 | 50,257 | 4.584 | 476,461 |
| Qwen/Qwen2.5-1.5B-Instruct | 151,643 | 4.563 | 478,607 |
| HuggingFaceTB/SmolLM2-135M-Instruct | 49,152 | 4.475 | 488,120 |
Instructions (SmolTalk, 504 samples)
| Tokenizer | Vocab Size | Chars/Token | Tokens |
|---|---|---|---|
| meta-llama/Llama-3.2-3B-Instruct | 128,000 | 4.771 | 737,130 |
| Qwen/Qwen2.5-1.5B-Instruct | 151,643 | 4.731 | 743,360 |
| JonathanMiddleton/daisy | 49,131 | 4.487 | 783,803 |
| HuggingFaceTB/SmolLM2-135M-Instruct | 49,152 | 4.455 | 789,399 |
| microsoft/phi-2 | 50,257 | 4.437 | 792,658 |
| openai-community/gpt2 | 50,257 | 4.254 | 826,711 |
Cross-Content Average
| Tokenizer | Python | Prose | Instruction | Average |
|---|---|---|---|---|
| meta-llama/Llama-3.2-3B-Instruct | 4.391 | 4.681 | 4.771 | 4.614 |
| Qwen/Qwen2.5-1.5B-Instruct | 4.366 | 4.563 | 4.731 | 4.554 |
| JonathanMiddleton/daisy | 3.766 | 4.594 | 4.487 | 4.282 |
| HuggingFaceTB/SmolLM2-135M-Instruct | 3.906 | 4.475 | 4.455 | 4.278 |
| microsoft/phi-2 | 3.628 | 4.584 | 4.437 | 4.216 |
| openai-community/gpt2 | 3.152 | 4.584 | 4.254 | 3.997 |
Key findings: Daisy achieves competitive compression with a ~49K vocabulary, ranking 2nd among tested similar-sized tokenizers for prose and instructions while maintaining strong Python performance.
Training Data
- General text: lehduong/nemotron-cc-hq (~60%)
- Python code: HuggingFaceTB/smoltalk, self-oss-instruct (~25%)
- Instructions: HuggingFaceTB/OpenHermes-2.5-H4, OpenHermes (~15%)
License
Apache 2.0
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="JonathanMiddleton/daisy") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)