Instructions to use openbmb/MiniCPM5-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/MiniCPM5-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openbmb/MiniCPM5-1B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B") model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openbmb/MiniCPM5-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/MiniCPM5-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/openbmb/MiniCPM5-1B
- SGLang
How to use openbmb/MiniCPM5-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use openbmb/MiniCPM5-1B with Docker Model Runner:
docker model run hf.co/openbmb/MiniCPM5-1B
Question about fine-tuning / inference speed
Hello everyone,
I am currently testing this model and would like to ask a couple of questions:
- What are the recommended hyperparameters for fine-tuning this architecture?
- Has anyone measured the average inference latency on a single T4 or V100 GPU?
Thank you for your help and for sharing this model!
I am experiencing the exact same issue with this model.
Could you please share if you found any workaround or solution for this?
Thanks!
Test
here's the setup we used for a long-CoT SFT of the ~1B MiniCPM5 model. These are framework-agnostic:
Batching
- micro-batch size: 1
- global batch size: 128
Sequence / positions
- sequence length: 65536 (64K)
- max position embeddings: 131072 (128K)
- RoPE base (theta): 5,000,000
LR & schedule
- peak LR: 5.22e-5, min LR: 5.22e-6
- schedule: WSD (warmup–stable–decay)
- warmup: 250 iters; decay: exponential over the last 2750 of 3000 total iters
Parallelism (cluster-dependent)
- tensor parallel: 1
- context parallel: 4 (needed to fit the 64K sequence in memory; you can drop it at shorter sequence length)
Mode
- continued fine-tune from the released checkpoint (weights only; data/sampler state reset)
For Q2 (inference latency on a single T4 / V100, served with vLLM or SGLang) — I don't have measured numbers, but the architecture is plain Llama (1.08B total / ~0.68B non-embedding params, GQA 16:2, bf16), so here's a first-principles estimate. vLLM and SGLang behave essentially the same here since both use PagedAttention + continuous batching; the limits below are hardware, not framework.
Single request, batch = 1 (latency-oriented)
Decode is memory-bandwidth-bound: the engine reads ~1.8 GB of weights per token (transformer blocks + the untied lm_head). At a realistic ~70% bandwidth utilization:
- V100 (~900 GB/s HBM2): ~3 ms/token → roughly 300–350 tok/s
- T4 (~320 GB/s GDDR6): ~8 ms/token → roughly 110–130 tok/s
Prefill / TTFT (~512-token prompt, ~0.7 TFLOP) is compute-bound, at ~50% MFU:
- V100 (~125 TFLOPS FP16): ~12–20 ms
- T4 (~65 TFLOPS FP16): ~22–40 ms
Many requests, continuous batching (throughput-oriented)
This is where both engines win. KV cache is tiny thanks to GQA — 48 KB/token (2 kv-heads × 128 × 24 layers × 2 B), so weights (2.16 GB) leave ~13 GB on a 16 GB card → room for hundreds of thousands of cached tokens, i.e. large batches. Aggregate decode throughput scales until you become compute-bound:
- V100: expect ~2,000–4,000 tok/s aggregate at batch ~32–64
- T4: expect ~600–1,200 tok/s aggregate at similar batch