MiniBananaMind-v3-9M
MiniBananaMind-v3-9M is a small causal language model trained from scratch on FineWeb-Edu and FineMath.
The model has about 8.9M parameters and uses a custom 8k-token byte-level BPE tokenizer with digit-aware tokenization.
This is a base language model, not an instruction-tuned chat assistant.
Model Details
| Field | Value |
|---|---|
| Parameters | 8,884,992 |
| Architecture | Custom Llama-style decoder |
| Layers | 9 |
| Hidden size | 256 |
| Intermediate size | 768 |
| Attention heads | 8 |
| KV heads | 2 |
| Vocabulary size | 8,192 |
| Context length | 1,024 |
| Embeddings | Tied input/output embeddings |
| Weight format | safetensors |
| Training precision | BF16 |
| Checkpoint used | latest mixed checkpoint |
Tokenizer
MiniBananaMind-v3-9M uses a new digit-aware 8k tokenizer.
Digits are kept as separate tokens so numbers do not collapse into large number tokens during tokenization.
Digit IDs:
| Token | ID |
|---|---|
1 |
9 |
2 |
10 |
3 |
11 |
4 |
12 |
5 |
13 |
6 |
14 |
7 |
15 |
8 |
16 |
9 |
17 |
0 |
18 |
Examples:
18 -> [9, 16]
227 -> [10, 10, 15]
Training Data
MiniBananaMind-v3-9M was trained on:
HuggingFaceFW/fineweb-eduHuggingFaceTB/finemath
The training mix used both general educational web text and math-heavy text.
| Dataset | Tokens |
|---|---|
| FineWeb-Edu sample-10BT retokenized with digit tokenizer | 12,047,375,481 |
| FineMath finemath-4plus retokenized with digit tokenizer | 1,500,000,000 |
Training setup:
| Field | Value |
|---|---|
| Sequence length | 1,024 |
| FineMath sampling ratio | 30% |
| FineWeb sampling ratio | 70% |
| Batch size | 72 |
| Gradient accumulation | 16 |
| Tokens per optimizer step | 1,179,648 |
| Training steps | 11,471 |
| Approx training tokens seen | 13,531,742,208 |
| Learning rate | 5e-4 |
| Minimum learning rate | 5e-5 |
| Warmup steps | 500 |
| Weight decay | 0.1 |
| Hardware | NVIDIA RTX 5070 Ti |
Evaluation
Formal benchmark results for this checkpoint are not included yet.
Usage
This model uses custom architecture code, so load it with trust_remote_code=True.
Install dependencies:
pip install -U transformers safetensors torch
Run inference:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "BananaMind/MiniBananaMind-v3-9M"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16,
).cuda().eval()
prompt = "The color of the sky is "
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
output = model.generate(
input_ids=input_ids,
max_new_tokens=64,
do_sample=False,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Suggested Generation Settings
For stable continuations:
do_sample=Falserepetition_penalty=1.1max_new_tokens=64to128
For more varied text:
do_sample=Truetemperature=0.6top_p=0.9repetition_penalty=1.1max_new_tokens=64to128
License
Apache 2.0
- Downloads last month
- 12
