Instructions to use LiteMind/YutaLM-M2-bnb-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LiteMind/YutaLM-M2-bnb-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LiteMind/YutaLM-M2-bnb-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LiteMind/YutaLM-M2-bnb-4bit")
model = AutoModelForCausalLM.from_pretrained("LiteMind/YutaLM-M2-bnb-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use LiteMind/YutaLM-M2-bnb-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LiteMind/YutaLM-M2-bnb-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LiteMind/YutaLM-M2-bnb-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LiteMind/YutaLM-M2-bnb-4bit

SGLang

How to use LiteMind/YutaLM-M2-bnb-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LiteMind/YutaLM-M2-bnb-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LiteMind/YutaLM-M2-bnb-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LiteMind/YutaLM-M2-bnb-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LiteMind/YutaLM-M2-bnb-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use LiteMind/YutaLM-M2-bnb-4bit with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LiteMind/YutaLM-M2-bnb-4bit to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LiteMind/YutaLM-M2-bnb-4bit to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for LiteMind/YutaLM-M2-bnb-4bit to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="LiteMind/YutaLM-M2-bnb-4bit",
    max_seq_length=2048,
)

Docker Model Runner
How to use LiteMind/YutaLM-M2-bnb-4bit with Docker Model Runner:
```
docker model run hf.co/LiteMind/YutaLM-M2-bnb-4bit
```

YutaLM-M2-bnb-4bit

YutaLM-M2-bnb-4bit is a lightweight, highly optimized 350M parameter hybrid architecture model fine-tuned specifically for Arabic Chat and Roleplay.

Developed under the LiteMind initiative, this model addresses the classic limitations of small-scale language models handling complex Arabic morphology and dialogue. By leveraging advanced target module tuning alongside direct training on Arabic linguistic embeddings (embed_tokens and lm_head), YutaLM-M2 delivers fluid, contextually expressive, and culturally nuanced Arabic interactions while maintaining an incredibly small hardware footprint.

🚀 Key Features

Feature	Description
Tailored for Arabic Roleplay & Chat	Fine-tuned on conversational datasets to provide engaging, expressive, and grammatically sound Arabic outputs.
Hybrid Architecture Optimization	Safely trained using advanced gradient handling to perfectly accommodate the model's specialized convolution (conv) and recurrence layer mechanics.
Enhanced Arabic Tokenization	Unlike naive fine-tunes, this model had its vocabulary embeddings and language model head explicitly trained to prevent letter-mixing and broken script generation.
Ultra-Low Resource Footprint	Pre-quantized and merged natively in 4-bit using bitsandbytes (`forced_merged_4bit`), allowing seamless deployment and rapid inference on consumer-grade GPUs or free cloud tiers (like Google Colab T4).

📝 Prompt Template & Format

The model utilizes the standard ChatML (LFM) sequence format. To achieve the best conversational or roleplaying stance, structure your inputs as follows:

<|im_start|>user
{Your Prompt Here}
<|im_end|>
<|im_start|>assistant

⚠️ Note: To prevent repetitive or broken generation artifacts common in smaller models, it is highly recommended to enforce a slight repetition penalty (1.05) and set a moderate temperature during inference.

💻 Quick Start & Inference

You can easily run this model using the Hugging Face transformers library. Ensure you have bitsandbytes and accelerate installed.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name = "LiteMind/YutaLM-M2-bnb-4bit"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare the conversation
messages = [
    {"role": "user", "content": "مرحباً يا يوتا، كيف يمكنني برمجة نموذج ذكاء اصطناعي صغير؟"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to("cuda")

# Initialize streamer for real-time output
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate response
print("Assistant: ")
_ = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.73,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.05,
    streamer=streamer
)

⚙️ Development & Optimization

The model was meticulously fine-tuned using the Unsloth framework on a single-GPU instance.

To ensure the model didn't lose its linguistic capabilities during adaptation, a masked training approach was utilized to calculate loss exclusively on the assistant's rich Arabic responses. Additionally, vocabulary layers were dynamically updated to absorb Arabic stylistic semantics, preventing the typical degradation seen in smaller post-quantized models.

🛑 Limitations & Biases

Due to its compact 350M parameter size, YutaLM-M2 should be treated as a specialized creative and conversational assistant rather than a factual encyclopedia.
It may occasionally experience hallucinations if prompted with highly complex logical/mathematical problems.
Performance is heavily optimized for Arabic; performance on multi-lingual switching or raw English coding might be limited compared to generalist base models.

📜 License

This model is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). You are free to use, modify, and host this model, provided that any derivative works, modified versions, or web services leveraging this model are also open-sourced under the same AGPL-3.0 terms.

🤝 Acknowledgements

Special thanks to the Unsloth team for providing the memory-efficient frameworks that make fine-tuning hybrid, low-parameter models accessible and highly performant.

Downloads last month: 48

Safetensors

Model size

0.9B params

Tensor type

F32

F16

Model tree for LiteMind/YutaLM-M2-bnb-4bit

Base model

LiquidAI/LFM2-350M

Quantized

(35)

this model