Question about fine-tuning / inference speed

#13
by piterpetro - opened

Hello everyone,

I am currently testing this model and would like to ask a couple of questions:

  1. What are the recommended hyperparameters for fine-tuning this architecture?
  2. Has anyone measured the average inference latency on a single T4 or V100 GPU?

Thank you for your help and for sharing this model!

I am experiencing the exact same issue with this model.
Could you please share if you found any workaround or solution for this?
Thanks!

Test

here's the setup we used for a long-CoT SFT of the ~1B MiniCPM5 model. These are framework-agnostic:

Batching

  • micro-batch size: 1
  • global batch size: 128

Sequence / positions

  • sequence length: 65536 (64K)
  • max position embeddings: 131072 (128K)
  • RoPE base (theta): 5,000,000

LR & schedule

  • peak LR: 5.22e-5, min LR: 5.22e-6
  • schedule: WSD (warmup–stable–decay)
  • warmup: 250 iters; decay: exponential over the last 2750 of 3000 total iters

Parallelism (cluster-dependent)

  • tensor parallel: 1
  • context parallel: 4 (needed to fit the 64K sequence in memory; you can drop it at shorter sequence length)

Mode

  • continued fine-tune from the released checkpoint (weights only; data/sampler state reset)
OpenBMB org

For Q2 (inference latency on a single T4 / V100, served with vLLM or SGLang) — I don't have measured numbers, but the architecture is plain Llama (1.08B total / ~0.68B non-embedding params, GQA 16:2, bf16), so here's a first-principles estimate. vLLM and SGLang behave essentially the same here since both use PagedAttention + continuous batching; the limits below are hardware, not framework.

Single request, batch = 1 (latency-oriented)

Decode is memory-bandwidth-bound: the engine reads ~1.8 GB of weights per token (transformer blocks + the untied lm_head). At a realistic ~70% bandwidth utilization:

  • V100 (~900 GB/s HBM2): ~3 ms/token → roughly 300–350 tok/s
  • T4 (~320 GB/s GDDR6): ~8 ms/token → roughly 110–130 tok/s

Prefill / TTFT (~512-token prompt, ~0.7 TFLOP) is compute-bound, at ~50% MFU:

  • V100 (~125 TFLOPS FP16): ~12–20 ms
  • T4 (~65 TFLOPS FP16): ~22–40 ms

Many requests, continuous batching (throughput-oriented)

This is where both engines win. KV cache is tiny thanks to GQA — 48 KB/token (2 kv-heads × 128 × 24 layers × 2 B), so weights (2.16 GB) leave ~13 GB on a 16 GB card → room for hundreds of thousands of cached tokens, i.e. large batches. Aggregate decode throughput scales until you become compute-bound:

  • V100: expect ~2,000–4,000 tok/s aggregate at batch ~32–64
  • T4: expect ~600–1,200 tok/s aggregate at similar batch
suhmily changed discussion status to closed

Sign up or log in to comment