Instructions to use openbmb/MiniCPM5-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openbmb/MiniCPM5-1B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openbmb/MiniCPM5-1B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B")
model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use openbmb/MiniCPM5-1B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openbmb/MiniCPM5-1B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM5-1B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/openbmb/MiniCPM5-1B

SGLang

How to use openbmb/MiniCPM5-1B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openbmb/MiniCPM5-1B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM5-1B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openbmb/MiniCPM5-1B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM5-1B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use openbmb/MiniCPM5-1B with Docker Model Runner:
```
docker model run hf.co/openbmb/MiniCPM5-1B
```

Question about fine-tuning / inference speed

#13

by piterpetro - opened 2 days ago

Discussion

piterpetro

2 days ago

Hello everyone,

I am currently testing this model and would like to ask a couple of questions:

What are the recommended hyperparameters for fine-tuning this architecture?
Has anyone measured the average inference latency on a single T4 or V100 GPU?

Thank you for your help and for sharing this model!

piterpetro

2 days ago

I am experiencing the exact same issue with this model.
Could you please share if you found any workaround or solution for this?
Thanks!

piterpetro

2 days ago

Test

suhmily

OpenBMB org 1 day ago

•

edited 1 day ago

here's the setup we used for a long-CoT SFT of the ~1B MiniCPM5 model. These are framework-agnostic:

Batching

micro-batch size: 1
global batch size: 128

Sequence / positions

sequence length: 65536 (64K)
max position embeddings: 131072 (128K)
RoPE base (theta): 5,000,000

LR & schedule

peak LR: 5.22e-5, min LR: 5.22e-6
schedule: WSD (warmup–stable–decay)
warmup: 250 iters; decay: exponential over the last 2750 of 3000 total iters

Parallelism (cluster-dependent)

tensor parallel: 1
context parallel: 4 (needed to fit the 64K sequence in memory; you can drop it at shorter sequence length)

Mode

continued fine-tune from the released checkpoint (weights only; data/sampler state reset)

suhmily

OpenBMB org 1 day ago

For Q2 (inference latency on a single T4 / V100, served with vLLM or SGLang) — I don't have measured numbers, but the architecture is plain Llama (1.08B total / ~0.68B non-embedding params, GQA 16:2, bf16), so here's a first-principles estimate. vLLM and SGLang behave essentially the same here since both use PagedAttention + continuous batching; the limits below are hardware, not framework.

Single request, batch = 1 (latency-oriented)

Decode is memory-bandwidth-bound: the engine reads ~1.8 GB of weights per token (transformer blocks + the untied lm_head). At a realistic ~70% bandwidth utilization:

V100 (~900 GB/s HBM2): ~3 ms/token → roughly 300–350 tok/s
T4 (~320 GB/s GDDR6): ~8 ms/token → roughly 110–130 tok/s

Prefill / TTFT (~512-token prompt, ~0.7 TFLOP) is compute-bound, at ~50% MFU:

V100 (~125 TFLOPS FP16): ~12–20 ms
T4 (~65 TFLOPS FP16): ~22–40 ms

Many requests, continuous batching (throughput-oriented)

This is where both engines win. KV cache is tiny thanks to GQA — ~~48 KB/token (2 kv-heads × 128 × 24 layers × 2 B), so weights (~~2.16 GB) leave ~13 GB on a 16 GB card → room for hundreds of thousands of cached tokens, i.e. large batches. Aggregate decode throughput scales until you become compute-bound:

V100: expect ~2,000–4,000 tok/s aggregate at batch ~32–64
T4: expect ~600–1,200 tok/s aggregate at similar batch

suhmily changed discussion status to closed 1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment