Liquid AI
Try LFMDocsLEAPDiscord

LFM2.5-230M-ONNX

ONNX export of LFM2.5-230M for cross-platform inference.

LFM2.5 is a hybrid architecture combining multiplicative gates and short convolutions, optimized for edge deployment with fast inference on CPU, GPU, and NPU hardware.

Recommended Variants

Variant Size Description
FP16 ~455 MB All weights in FP16
Q4 ~200 MB INT4 embedding (GatherBlockQuantized), INT4 lm_head (MatMulNBits, shared), INT4 MatMul weights
Q4F32 ~390 MB INT4 MatMul weights, FP32 embedding and norms
Q8 ~470 MB INT8 MatMul weights, FP32 embedding and norms

Q4 uses GatherBlockQuantized for the token embedding and MatMulNBits for the lm_head, reusing the same quantized weights and scales. All other linear layers are quantized to INT4 via post-export MatMulNBitsQuantizer. Block size is 32.

Q4F32 keeps the embedding as a FP32 Gather and the lm_head as FP32 Transpose + MatMul. Only the internal linear layers (attention projections, conv projections, MLP) are quantized to INT4 via post-export MatMulNBitsQuantizer.

Q8 is the same structure as Q4F32 but with INT8 weights (asymmetric quantization).

Generation Parameters

Parameter Value
temperature 0.1
top_k 50
repetition_penalty 1.05

Model Files

onnx/
├── model.onnx              # FP32
├── model_fp16.onnx         # FP16
├── model_q4.onnx           # Q4
├── model_q4f32.onnx        # Q4F32
└── model_q8.onnx           # Q8

Python

Installation

pip install onnxruntime transformers numpy huggingface_hub
# or with GPU support:
pip install onnxruntime-gpu transformers numpy huggingface_hub

Inference

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# Download model
model_id = "LiquidAI/LFM2.5-230M-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")
data_path = hf_hub_download(model_id, "onnx/model_q4.onnx_data")

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Sampling parameters
TEMPERATURE = 0.1
TOP_K = 50
REPETITION_PENALTY = 1.05

# Prepare chat input
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = np.array([tokenizer.encode(prompt, add_special_tokens=False)], dtype=np.int64)

# Initialize KV cache
ONNX_DTYPE = {"tensor(float)": np.float32, "tensor(float16)": np.float16, "tensor(int64)": np.int64}
cache = {}
for inp in session.get_inputs():
    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    cache[inp.name] = np.zeros(shape, dtype=ONNX_DTYPE.get(inp.type, np.float32))

# Check if model uses position_ids
input_names = {inp.name for inp in session.get_inputs()}
use_position_ids = "position_ids" in input_names


def sample_token(logits, generated_tokens):
    """Sample next token with temperature, top-k, and repetition penalty."""
    # Apply repetition penalty
    for token_id in set(generated_tokens):
        if logits[token_id] > 0:
            logits[token_id] /= REPETITION_PENALTY
        else:
            logits[token_id] *= REPETITION_PENALTY

    # Apply temperature
    logits = logits / TEMPERATURE

    # Top-k filtering
    top_k_indices = np.argpartition(logits, -TOP_K)[-TOP_K:]
    top_k_logits = logits[top_k_indices]

    # Softmax over top-k
    top_k_logits -= np.max(top_k_logits)
    probs = np.exp(top_k_logits) / np.sum(np.exp(top_k_logits))

    # Sample
    chosen = np.random.choice(len(top_k_indices), p=probs)
    return int(top_k_indices[chosen])


# Generate tokens
seq_len = input_ids.shape[1]
generated_tokens = []

for step in range(512):  # max tokens
    if step == 0:
        ids = input_ids
        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
    else:
        ids = np.array([[generated_tokens[-1]]], dtype=np.int64)
        pos = np.array([[seq_len + len(generated_tokens) - 1]], dtype=np.int64)

    attn_mask = np.ones((1, seq_len + len(generated_tokens)), dtype=np.int64)
    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
    if use_position_ids:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    logits = outputs[0][0, -1].copy()
    next_token = sample_token(logits, generated_tokens)
    generated_tokens.append(next_token)

    # Update cache
    for i, out in enumerate(session.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv").replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == tokenizer.eos_token_id:
        break

print(tokenizer.decode(generated_tokens, skip_special_tokens=True))

WebGPU (Browser)

Installation

npm install onnxruntime-web @huggingface/transformers

Enable WebGPU

WebGPU is required for browser inference. To enable:

  1. Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
  2. Verify: Check chrome://gpu for "WebGPU" status
  3. Test: Run navigator.gpu.requestAdapter() in DevTools console

Inference

import * as ort from "onnxruntime-web/webgpu";
import { AutoTokenizer } from "@huggingface/transformers";

// Check WebGPU availability
if (!navigator.gpu) {
  throw new Error("WebGPU not available. Enable at chrome://flags/#enable-unsafe-webgpu");
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
  throw new Error("WebGPU adapter not found. Check chrome://gpu for status.");
}

ort.env.wasm.numThreads = 1;

const modelId = "LiquidAI/LFM2.5-230M-ONNX";
const modelBase = `https://huggingface.co/${modelId}/resolve/main`;

// Load tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(modelId);

// Load ONNX session with external data
const onnxPath = `${modelBase}/onnx/model_q4.onnx`;
const dataPath = `${modelBase}/onnx/model_q4.onnx_data`;
const session = await ort.InferenceSession.create(onnxPath, {
  executionProviders: ["webgpu"],
  externalData: [{ path: "model_q4.onnx_data", data: dataPath }],
});

// Sampling parameters
const TEMPERATURE = 0.1;
const TOP_K = 50;
const REPETITION_PENALTY = 1.05;

// Model config (from config.json)
const hiddenSize = 1024;
const numKVHeads = 8;
const headDim = 64;

// Initialize KV cache
function initCache() {
  const cache = {};
  for (const name of session.inputNames) {
    if (name.startsWith("past_conv")) {
      cache[name] = new ort.Tensor("float32", new Float32Array(hiddenSize * 3), [1, hiddenSize, 3]);
    } else if (name.startsWith("past_key_values")) {
      cache[name] = new ort.Tensor("float32", new Float32Array(0), [1, numKVHeads, 0, headDim]);
    }
  }
  return cache;
}

// Update cache from outputs
function updateCache(cache, outputs) {
  for (const [name, tensor] of Object.entries(outputs)) {
    if (name.startsWith("present_conv")) {
      cache[name.replace("present_conv", "past_conv")] = tensor;
    } else if (name.startsWith("present.")) {
      cache[name.replace("present.", "past_key_values.")] = tensor;
    }
  }
}

// Sample next token with temperature, top-k, and repetition penalty
function sampleToken(logitsData, vocabSize, generatedTokens) {
  const logits = new Float32Array(logitsData);

  // Apply repetition penalty
  const seen = new Set(generatedTokens);
  for (const tokenId of seen) {
    if (logits[tokenId] > 0) {
      logits[tokenId] /= REPETITION_PENALTY;
    } else {
      logits[tokenId] *= REPETITION_PENALTY;
    }
  }

  // Apply temperature
  for (let i = 0; i < vocabSize; i++) {
    logits[i] /= TEMPERATURE;
  }

  // Top-k: find top K indices
  const indexed = Array.from(logits.slice(0, vocabSize), (v, i) => [v, i]);
  indexed.sort((a, b) => b[0] - a[0]);
  const topK = indexed.slice(0, TOP_K);

  // Softmax over top-k
  const maxLogit = topK[0][0];
  const exps = topK.map(([v, i]) => [Math.exp(v - maxLogit), i]);
  const sumExp = exps.reduce((s, [e]) => s + e, 0);
  const probs = exps.map(([e, i]) => [e / sumExp, i]);

  // Sample from distribution
  let r = Math.random();
  for (const [p, i] of probs) {
    r -= p;
    if (r <= 0) return i;
  }
  return probs[probs.length - 1][1];
}

// Build prompt and tokenize
const messages = [{ role: "user", content: "What is the capital of France?" }];
const prompt = tokenizer.apply_chat_template(messages, { add_generation_prompt: true, tokenize: false });
const inputIds = tokenizer.encode(prompt);

// Generation loop
const cache = initCache();
const eosTokenId = tokenizer.eos_token_id;
const generatedTokens = [];
let curLen = inputIds.length;
let ids = inputIds;

for (let step = 0; step < 512; step++) {
  const inputIdsTensor = new ort.Tensor("int64", new BigInt64Array(ids.map(BigInt)), [1, ids.length]);
  const attentionMask = new ort.Tensor("int64", new BigInt64Array(curLen).fill(1n), [1, curLen]);

  const outputs = await session.run({ input_ids: inputIdsTensor, attention_mask: attentionMask, ...cache });

  const logits = outputs.logits;
  const vocabSize = logits.dims[2];
  const lastLogits = logits.data.slice((logits.dims[1] - 1) * vocabSize, logits.dims[1] * vocabSize);
  const nextToken = sampleToken(lastLogits, vocabSize, generatedTokens);

  generatedTokens.push(nextToken);
  if (nextToken === eosTokenId) break;

  updateCache(cache, outputs);
  ids = [nextToken];
  curLen++;
}

console.log(tokenizer.decode(generatedTokens, { skip_special_tokens: true }));

WebGPU Notes

  • Models use external data files (.onnx_data) that are loaded automatically
  • int64 tensors require BigInt64Array

Building from Source

This ONNX package was created using the official Liquid4All/onnx-export tool (the reference exporter for all LFM2/LFM2.5 ONNX checkpoints).

git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

# Export all precisions (fp16, q4, q4f32, q8) + FP32 base
uv run lfm2-export LiquidAI/LFM2.5-230M --precision

The output lands in exports/LFM2.5-230M-ONNX/. Copy its contents (or the onnx/ subdir + metadata) to your target repo root for publishing to the Hugging Face Hub.

The builder manually constructs the ONNX graph (using onnx.helper) for maximum compatibility with ONNX Runtime WebGPU, Transformers.js, and Microsoft custom ops (SimplifiedLayerNormalization, RotaryEmbedding, GroupQueryAttention, GatherBlockQuantized, MatMulNBits, etc.). Standard torch.onnx.export is not used.

License

This model is released under the LFM 1.0 License.

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lulzx/LFM2.5-230M-ONNX

Quantized
(14)
this model