Instructions to use LiteMind/YutaLM-M2-bnb-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LiteMind/YutaLM-M2-bnb-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="LiteMind/YutaLM-M2-bnb-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("LiteMind/YutaLM-M2-bnb-4bit") model = AutoModelForCausalLM.from_pretrained("LiteMind/YutaLM-M2-bnb-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use LiteMind/YutaLM-M2-bnb-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LiteMind/YutaLM-M2-bnb-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LiteMind/YutaLM-M2-bnb-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/LiteMind/YutaLM-M2-bnb-4bit
- SGLang
How to use LiteMind/YutaLM-M2-bnb-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "LiteMind/YutaLM-M2-bnb-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LiteMind/YutaLM-M2-bnb-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "LiteMind/YutaLM-M2-bnb-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LiteMind/YutaLM-M2-bnb-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use LiteMind/YutaLM-M2-bnb-4bit with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LiteMind/YutaLM-M2-bnb-4bit to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LiteMind/YutaLM-M2-bnb-4bit to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for LiteMind/YutaLM-M2-bnb-4bit to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="LiteMind/YutaLM-M2-bnb-4bit", max_seq_length=2048, ) - Docker Model Runner
How to use LiteMind/YutaLM-M2-bnb-4bit with Docker Model Runner:
docker model run hf.co/LiteMind/YutaLM-M2-bnb-4bit
YutaLM-M2-bnb-4bit
YutaLM-M2-bnb-4bit is a lightweight, highly optimized 350M parameter hybrid architecture model fine-tuned specifically for Arabic Chat and Roleplay.
Developed under the LiteMind initiative, this model addresses the classic limitations of small-scale language models handling complex Arabic morphology and dialogue. By leveraging advanced target module tuning alongside direct training on Arabic linguistic embeddings (embed_tokens and lm_head), YutaLM-M2 delivers fluid, contextually expressive, and culturally nuanced Arabic interactions while maintaining an incredibly small hardware footprint.
๐ Key Features
| Feature | Description |
|---|---|
| Tailored for Arabic Roleplay & Chat | Fine-tuned on conversational datasets to provide engaging, expressive, and grammatically sound Arabic outputs. |
| Hybrid Architecture Optimization | Safely trained using advanced gradient handling to perfectly accommodate the model's specialized convolution (conv) and recurrence layer mechanics. |
| Enhanced Arabic Tokenization | Unlike naive fine-tunes, this model had its vocabulary embeddings and language model head explicitly trained to prevent letter-mixing and broken script generation. |
| Ultra-Low Resource Footprint | Pre-quantized and merged natively in 4-bit using bitsandbytes (forced_merged_4bit), allowing seamless deployment and rapid inference on consumer-grade GPUs or free cloud tiers (like Google Colab T4). |
๐ Prompt Template & Format
The model utilizes the standard ChatML (LFM) sequence format. To achieve the best conversational or roleplaying stance, structure your inputs as follows:
<|im_start|>user
{Your Prompt Here}
<|im_end|>
<|im_start|>assistant
โ ๏ธ Note: To prevent repetitive or broken generation artifacts common in smaller models, it is highly recommended to enforce a slight repetition penalty (
1.05) and set a moderate temperature during inference.
๐ป Quick Start & Inference
You can easily run this model using the Hugging Face transformers library. Ensure you have bitsandbytes and accelerate installed.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model_name = "LiteMind/YutaLM-M2-bnb-4bit"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Prepare the conversation
messages = [
{"role": "user", "content": "ู
ุฑุญุจุงู ูุง ููุชุงุ ููู ูู
ูููู ุจุฑู
ุฌุฉ ูู
ูุฐุฌ ุฐูุงุก ุงุตุทูุงุนู ุตุบูุฑุ"}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
).to("cuda")
# Initialize streamer for real-time output
streamer = TextStreamer(tokenizer, skip_prompt=True)
# Generate response
print("Assistant: ")
_ = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.73,
top_k=50,
top_p=0.9,
repetition_penalty=1.05,
streamer=streamer
)
โ๏ธ Development & Optimization
The model was meticulously fine-tuned using the Unsloth framework on a single-GPU instance.
To ensure the model didn't lose its linguistic capabilities during adaptation, a masked training approach was utilized to calculate loss exclusively on the assistant's rich Arabic responses. Additionally, vocabulary layers were dynamically updated to absorb Arabic stylistic semantics, preventing the typical degradation seen in smaller post-quantized models.
๐ Limitations & Biases
- Due to its compact 350M parameter size, YutaLM-M2 should be treated as a specialized creative and conversational assistant rather than a factual encyclopedia.
- It may occasionally experience hallucinations if prompted with highly complex logical/mathematical problems.
- Performance is heavily optimized for Arabic; performance on multi-lingual switching or raw English coding might be limited compared to generalist base models.
๐ License
This model is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). You are free to use, modify, and host this model, provided that any derivative works, modified versions, or web services leveraging this model are also open-sourced under the same AGPL-3.0 terms.
๐ค Acknowledgements
Special thanks to the Unsloth team for providing the memory-efficient frameworks that make fine-tuning hybrid, low-parameter models accessible and highly performant.
- Downloads last month
- 48
Model tree for LiteMind/YutaLM-M2-bnb-4bit
Base model
LiquidAI/LFM2-350M