Chisel-7B-v0.1

Chisel-7B is a vocabulary expanded variant of Mistral-7B-v0.1 with targeted Urdu script support. The model is a pre-finetuning checkpoint it does not yet understand Urdu semantics but tokenizes Urdu text with 36.3% lower fertility than the base model, making it a stronger foundation for Urdu continued pretraining or instruction tuning.

Results

Fertility vs. subwords added

N       Vocab     Tokens/Word     Reduction
──────────────────────────────────────────
0       32,000       4.59            —
30      32,030       4.55          0.9%
50      32,077       3.69         19.5%
100     32,125       3.47         24.5%
200     32,214       3.22         29.9%
300     32,295       3.09         32.6%
400     32,377       3.00         34.7%
500     32,467       2.92         36.3%   ← recommended
700     32,648       2.82         38.6%
1000    32,924       2.70         41.2%
1500    33,372       2.57         44.1%
2000    33,593       2.52         45.1%   ← maximum

N=500 recovers 80.5% of the maximum possible fertility reduction with only 1.46% vocabulary increase. Marginal gain per 100 subwords drops below 1.5% after N=500.

Per-sentence tokenization

Sentence	Words	Base tokens	Chisel tokens	Reduction
پاکستان ایک خوبصورت ملک ہے	5	26	13	50.0%
اردو زبان بہت میٹھی ہے	5	22	14	36.4%
کراچی پاکستان کا سب سے بڑا شہر ہے	8	35	20	42.9%
علم حاصل کرنا ہر مسلمان پر فرض ہے	8	33	18	45.5%
آج موسم بہت اچھا ہے	5	19	12	36.8%
Average	6.2	27.0	15.4	43.1%

English perplexity (Wikitext-2)

Model	Perplexity
Mistral-7B-v0.1 (4-bit)	9.04
Chisel-7B-v0.1 (4-bit)	10.82
Delta	+1.78

The +1.78 perplexity increase is attributable to softmax redistribution over the expanded output vocabulary. This degradation is expected to recover after continued pretraining on Urdu data, consistent with prior vocabulary expansion work (Hewitt 2021; Cui et al. 2023).

Vocabulary breakdown

Category	Count	Examples
Already in Mistral vocab	19	ٹ ڈ ں ھ ی ے ، ؟
Characters added (missing)	30	ڑ ۔ ۰–۹ ؛ ٪ ۓ ۍ ٖ ٗ
Subwords added (N=500)	437	frequency-ranked BPE units
Total new tokens	467
Final vocab size	32,467

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained(
    "mahwizzzz/Chisel-7B-v0.1",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mahwizzzz/Chisel-7B-v0.1")

# verify fertility improvement
text = "پاکستان ایک خوبصورت ملک ہے"
tokens = tokenizer.encode(text, add_special_tokens=False)
print(f"{len(text.split())} words → {len(tokens)} tokens")
# expected: 5 words → 13 tokens

Limitations

No Urdu semantic understanding. Responses will be in English until finetuned on Urdu data.
English perplexity degraded by +1.78 vs base model.
Fertility corpus was Wikipedia Urdu (500 articles). Domain-specific subwords (medical, legal, conversational) are underrepresented.
Not evaluated on any Urdu downstream benchmark.

Intended use

This checkpoint is intended as a starting point for:

Urdu continued pretraining
Urdu instruction tuning
Urdu translation, QA, and text generation research
Tokenization efficiency studies for low-resource Perso-Arabic script languages

Do not use in production Urdu applications without finetuning.

Technical specs


Base model	Mistral-7B-v0.1
Architecture	Transformer decoder, 32 layers, 4096 hidden
Original vocab	32,000
Expanded vocab	32,467
Quantization	4-bit NF4, double quant, bfloat16 compute
Embedding init	Multivariate normal (Hewitt 2021)
Expansion corpus	Wikipedia Urdu, 500 articles
BPE model	SentencePiece, vocab=2000

Citation

@misc{khalil2025chisel7b,
  author    = {Mahwiz Khalil},
  title     = {Chisel-7B: Selective Vocabulary Expansion for Urdu Adaptation of Mistral-7B},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/mahwizzzz/Chisel-7B-v0.1}
}

@article{jiang2023mistral,
  title   = {Mistral 7B},
  author  = {Jiang, Albert Q and others},
  journal = {arXiv preprint arXiv:2310.06825},
  year    = {2023}
}

@misc{hewitt2021initializing,
  author = {Hewitt, John},
  title  = {Initializing New Word Embeddings for Pretrained Language Models},
  year   = {2021},
  url    = {https://nlp.stanford.edu/~johnhew/vocab-expansion.html}
}

Downloads last month: 87

Safetensors

Model size

7B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahwizzzz/Chisel-7B-v0.1

Base model

mistralai/Mistral-7B-v0.1

Quantized

(192)

this model

Paper for mahwizzzz/Chisel-7B-v0.1

Mistral 7B

Paper • 2310.06825 • Published Oct 10, 2023 • 61