Try LFM • Docs • LEAP • Discord

LFM2.5-230M-ONNX

ONNX export of LFM2.5-230M for cross-platform inference.

LFM2.5 is a hybrid architecture combining multiplicative gates and short convolutions, optimized for edge deployment with fast inference on CPU, GPU, and NPU hardware.

Recommended Variants

Precision	Size	Platform	Use Case
Q4	~200 MB	WebGPU, Server	Recommended for most uses (quantized embedding)
Q4F32	~390 MB	Server (CPU/GPU)	Q4 weights with FP32 embedding — higher quality
FP16	~455 MB	WebGPU, Server	Higher quality
Q8	~470 MB	Server only	Balance of quality and size

WebGPU: Use Q4 or FP16 (Q4F32 and Q8 are not supported on WebGPU).
Server (CPU/GPU): All variants supported. Q4F32 keeps the embedding in FP32 for higher fidelity.

The tied embedding / LM head is kept in FP32 across all quantized builds.

Model Files

onnx/
├── model.onnx              # FP32
├── model_fp16.onnx         # FP16
├── model_q4.onnx           # Q4, quantized embedding (WebGPU)
├── model_q4f32.onnx        # Q4 weights, FP32 embedding (server)
└── model_q8.onnx           # Q8

Python (onnxruntime)

pip install onnxruntime transformers numpy huggingface_hub
# or, for GPU:
pip install onnxruntime-gpu transformers numpy huggingface_hub

from huggingface_hub import hf_hub_download

model_id = "LiquidAI/LFM2.5-230M-ONNX"
# Q4F32 recommended for server CPU/GPU; use model_q4.onnx for WebGPU.
hf_hub_download(model_id, "onnx/model_q4f32.onnx")
hf_hub_download(model_id, "onnx/model_q4f32.onnx_data")

WebGPU (Transformers.js)

import { pipeline } from "@huggingface/transformers";

const generator = await pipeline("text-generation", "LiquidAI/LFM2.5-230M-ONNX", {
  device: "webgpu",
  dtype: "q4", // or "fp16"
});

Downloads last month: 14

Model tree for LiquidAI/LFM2.5-230M-ONNX

Base model

LiquidAI/LFM2.5-230M-Base

Finetuned

LiquidAI/LFM2.5-230M

Quantized

(11)

this model