Ling-2.6-flash-MXFP4

~103B-A8B hybrid MoE — 63 GB on disk (down from the 200 GB bf16 source) — stock 4-bit affine quantization on inclusionAI's Bailing-V2.5 hybrid architecture. Loads via mlx_lm.load() with the bailing_hybrid model class — no TurboQuant runtime, no sidecar required.

  • Source: inclusionAI/Ling-2.6-flash (Ant Group's Bailing-V2.5 hybrid: 32 layers MLA + Lightning-Linear-Attention, 256 experts top-8, MTP head, 131K context)
  • Quantization: MXFP4 — every weight (routed experts, attention, shared experts, dense MLP, embed, lm_head) at 4-bit affine group_size=32. Norms, router gates, expert biases, and slopes stay fp16/fp32 passthrough.
  • Bundle size: 63 GB on-disk across 51 shards
  • Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio

Why two variants?

JANGTQ2 MXFP4
Routed experts 2-bit MXTQ codebook (Hadamard + Lloyd-Max) 4-bit affine
Attention / shared / dense 8-bit affine 4-bit affine
Bundle size 30 GB 63 GB
Quality tighter (8-bit attention) uniform 4-bit
Loader jang_tools.load_jangtq (TurboQuant kernel) stock mlx_lm.load()
Sidecar required not needed
Min RAM 64 GB 96 GB

JANGTQ2 trades cheap-but-slow MXTQ codec on the routed experts for a tighter overall bit budget. MXFP4 is the simpler "just-works-with-stock-MLX" option for users who don't want the TurboQuant runtime in their stack.

Architecture (bailing_hybrid)

Hybrid attention — every 8th layer is full softmax MLA, the other 28 of 32 are Lightning-Linear-Attention. Plus a Multi-Token Prediction head.

Layer block Count Attention MLP
Layer 0 1 Linear (GLA) Dense MLP (intermediate=9216)
Layers 1–6, 8–14, 16–22, 24–30 27 Linear (GLA) MoE (256+1)
Layers 7, 15, 23, 31 4 MLA (full softmax) MoE (256+1)
MTP head (32) 1 MLA MoE (256+1)

See the JANGTQ variant card for the deeper architecture writeup.

Loading (Python)

pip install mlx-lm jang-tools
from mlx_lm import load, generate
model, tokenizer = load("OsaurusAI/Ling-2.6-flash-MXFP4")

Stock mlx_lm.load() works once mlx_lm/models/bailing_hybrid.py is present (shipped with jang-tools >= TBD). The bundle's configuration_bailing_moe_v2_5.py and modeling_bailing_moe_v2_5.py provide HF compatibility for tooling that goes through transformers.

Reasoning + tools

Default is detailed thinking off. To enable:

messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user",   "content": "..."},
]

The model emits <think>...</think> reasoning blocks before answers when thinking is on. DeepSeek-style tool-call format.

Credits

  • Quantization + MLX runtime: Jinho Jang (eric@osaurus.ai)
  • Base model: inclusionAI — Ant Group's Bailing team
  • Architecture references: Lightning-Attention-2 (arXiv:2401.04658), DeepSeek-V2/V3 MLA, MTP from DeepSeek-V3
  • Osaurus: osaurus.ai — Apple-Silicon-first inference for open-weight LLMs.
Downloads last month
-
Safetensors
Model size
20B params
Tensor type
F32
·
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/Ling-2.6-flash-MXFP4

Finetuned
(5)
this model

Paper for OsaurusAI/Ling-2.6-flash-MXFP4