Chisel-7B-v0.1

Chisel-7B is a vocabulary expanded variant of Mistral-7B-v0.1 with targeted Urdu script support. The model is a pre-finetuning checkpoint it does not yet understand Urdu semantics but tokenizes Urdu text with 36.3% lower fertility than the base model, making it a stronger foundation for Urdu continued pretraining or instruction tuning.



Results

Fertility vs. subwords added

N       Vocab     Tokens/Word     Reduction
──────────────────────────────────────────
0       32,000       4.59            —
30      32,030       4.55          0.9%
50      32,077       3.69         19.5%
100     32,125       3.47         24.5%
200     32,214       3.22         29.9%
300     32,295       3.09         32.6%
400     32,377       3.00         34.7%
500     32,467       2.92         36.3%   ← recommended
700     32,648       2.82         38.6%
1000    32,924       2.70         41.2%
1500    33,372       2.57         44.1%
2000    33,593       2.52         45.1%   ← maximum

Result

N=500 recovers 80.5% of the maximum possible fertility reduction with only 1.46% vocabulary increase. Marginal gain per 100 subwords drops below 1.5% after N=500.

Per-sentence tokenization

Sentence Words Base tokens Chisel tokens Reduction
پاکستان ایک خوبصورت ملک ہے 5 26 13 50.0%
اردو زبان بہت میٹھی ہے 5 22 14 36.4%
کراچی پاکستان کا سب سے بڑا شہر ہے 8 35 20 42.9%
علم حاصل کرنا ہر مسلمان پر فرض ہے 8 33 18 45.5%
آج موسم بہت اچھا ہے 5 19 12 36.8%
Average 6.2 27.0 15.4 43.1%

English perplexity (Wikitext-2)

Model Perplexity
Mistral-7B-v0.1 (4-bit) 9.04
Chisel-7B-v0.1 (4-bit) 10.82
Delta +1.78

The +1.78 perplexity increase is attributable to softmax redistribution over the expanded output vocabulary. This degradation is expected to recover after continued pretraining on Urdu data, consistent with prior vocabulary expansion work (Hewitt 2021; Cui et al. 2023).


Vocabulary breakdown

Category Count Examples
Already in Mistral vocab 19 ٹ ڈ ں ھ ی ے ، ؟
Characters added (missing) 30 ڑ ۔ ۰–۹ ؛ ٪ ۓ ۍ ٖ ٗ
Subwords added (N=500) 437 frequency-ranked BPE units
Total new tokens 467
Final vocab size 32,467

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained(
    "mahwizzzz/Chisel-7B-v0.1",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mahwizzzz/Chisel-7B-v0.1")

# verify fertility improvement
text = "پاکستان ایک خوبصورت ملک ہے"
tokens = tokenizer.encode(text, add_special_tokens=False)
print(f"{len(text.split())} words → {len(tokens)} tokens")
# expected: 5 words → 13 tokens

Limitations

  • No Urdu semantic understanding. Responses will be in English until finetuned on Urdu data.
  • English perplexity degraded by +1.78 vs base model.
  • Fertility corpus was Wikipedia Urdu (500 articles). Domain-specific subwords (medical, legal, conversational) are underrepresented.
  • Not evaluated on any Urdu downstream benchmark.

Intended use

This checkpoint is intended as a starting point for:

  • Urdu continued pretraining
  • Urdu instruction tuning
  • Urdu translation, QA, and text generation research
  • Tokenization efficiency studies for low-resource Perso-Arabic script languages

Do not use in production Urdu applications without finetuning.


Technical specs

Base model Mistral-7B-v0.1
Architecture Transformer decoder, 32 layers, 4096 hidden
Original vocab 32,000
Expanded vocab 32,467
Quantization 4-bit NF4, double quant, bfloat16 compute
Embedding init Multivariate normal (Hewitt 2021)
Expansion corpus Wikipedia Urdu, 500 articles
BPE model SentencePiece, vocab=2000

Citation

@misc{khalil2025chisel7b,
  author    = {Mahwiz Khalil},
  title     = {Chisel-7B: Selective Vocabulary Expansion for Urdu Adaptation of Mistral-7B},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/mahwizzzz/Chisel-7B-v0.1}
}
@article{jiang2023mistral,
  title   = {Mistral 7B},
  author  = {Jiang, Albert Q and others},
  journal = {arXiv preprint arXiv:2310.06825},
  year    = {2023}
}
@misc{hewitt2021initializing,
  author = {Hewitt, John},
  title  = {Initializing New Word Embeddings for Pretrained Language Models},
  year   = {2021},
  url    = {https://nlp.stanford.edu/~johnhew/vocab-expansion.html}
}

Downloads last month
87
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahwizzzz/Chisel-7B-v0.1

Quantized
(192)
this model

Paper for mahwizzzz/Chisel-7B-v0.1