LFM2.5-230M-ONNX
ONNX export of LFM2.5-230M for cross-platform inference.
LFM2.5 is a hybrid architecture combining multiplicative gates and short convolutions, optimized for edge deployment with fast inference on CPU, GPU, and NPU hardware.
Recommended Variants
| Precision | Size | Platform | Use Case |
|---|---|---|---|
| Q4 | ~200 MB | WebGPU, Server | Recommended for most uses (quantized embedding) |
| Q4F32 | ~390 MB | Server (CPU/GPU) | Q4 weights with FP32 embedding โ higher quality |
| FP16 | ~455 MB | WebGPU, Server | Higher quality |
| Q8 | ~470 MB | Server only | Balance of quality and size |
- WebGPU: Use
Q4orFP16(Q4F32andQ8are not supported on WebGPU). - Server (CPU/GPU): All variants supported.
Q4F32keeps the embedding in FP32 for higher fidelity.
The tied embedding / LM head is kept in FP32 across all quantized builds.
Model Files
onnx/
โโโ model.onnx # FP32
โโโ model_fp16.onnx # FP16
โโโ model_q4.onnx # Q4, quantized embedding (WebGPU)
โโโ model_q4f32.onnx # Q4 weights, FP32 embedding (server)
โโโ model_q8.onnx # Q8
Python (onnxruntime)
pip install onnxruntime transformers numpy huggingface_hub
# or, for GPU:
pip install onnxruntime-gpu transformers numpy huggingface_hub
from huggingface_hub import hf_hub_download
model_id = "LiquidAI/LFM2.5-230M-ONNX"
# Q4F32 recommended for server CPU/GPU; use model_q4.onnx for WebGPU.
hf_hub_download(model_id, "onnx/model_q4f32.onnx")
hf_hub_download(model_id, "onnx/model_q4f32.onnx_data")
WebGPU (Transformers.js)
import { pipeline } from "@huggingface/transformers";
const generator = await pipeline("text-generation", "LiquidAI/LFM2.5-230M-ONNX", {
device: "webgpu",
dtype: "q4", // or "fp16"
});
- Downloads last month
- 14