Qwen3.6-35B-A3B — Evolved Mixed-Bit (MLX, 12.63 GB)

TL;DR: 12.63 GB MLX quant of Qwen3.6-35B-A3B. 16.8% smaller than the public TurboQuant-3bit MLX quant at statistically tied WikiText-2 perplexity. Loads with stock mlx-lm, no custom loader.

A per-module mixed-precision quantization of Qwen/Qwen3.6-35B-A3B for Apple Silicon / MLX. The bit-width and group-size of every quantized tensor were chosen by an evolutionary search (OpenEvolve / MAP-Elites) that minimizes memory under a perplexity budget — not a uniform setting.

Why this exists

Apple Silicon local-LLM users usually get a bad tradeoff:

  • 4-bit works, but costs memory
  • 3-bit saves memory, but quality can fall off
  • clever custom quant recipes often need custom loaders

This release is the boring-useful point: smaller than the known 3-bit quant, same measured perplexity, and still just mlx_lm.load(...).

Results

Perplexity on WikiText-2-raw test (~16,260 tokens, identical eval pipeline for every model, including the real public TurboQuant model). Lower is better.

Model Size Perplexity ↓ vs TurboQuant-3bit
Qwen3.6-35B-A3B-4bit (base) 19.51 GB 4.8308
Qwen3.6-35B-A3B-TurboQuant-MLX-3bit 15.18 GB 5.6743
this model (evolved) 12.63 GB 5.6673 −16.8 % size, −0.007 ppl (tied)

≈ 2.6 effective bits/weight. The Δppl vs TurboQuant (−0.007) is within measurement noise, so read this as "smaller at equal quality," not a quality win.

Usage

uv tool install mlx-lm
mlx_lm.generate --model inductiveML/Qwen3.6-35B-A3B-evolved-mxbit \
  --prompt "Write a short Python function to parse a CSV file."

Or from Python:

from mlx_lm import load, generate
model, tokenizer = load("inductiveML/Qwen3.6-35B-A3B-evolved-mxbit")
print(generate(model, tokenizer, prompt="The capital of France is", max_tokens=64))

Quantization recipe

The fused routed experts (mlp.switch_mlp.*) carry ~90% of the weight bytes, so the search spends its budget there and protects everything small and sensitive:

Component Precision
Routed experts — down_proj (all layers) 3-bit
Routed experts — gate/up_proj, layers < 9 3-bit
Routed experts — gate/up_proj, layers ≥ 9 2-bit
MoE routers (mlp.gate, shared_expert_gate) 8-bit (base, unchanged)
Embeddings, lm_head, attention, shared expert 4-bit (base, unchanged)

Deeper experts also use group_size=128 (vs 64) to shave overhead. The exact per-module policy is in the config.json quantization field (512 entries).

How it was made

The quantization "recipe" is a function choose_bits(path, info) → {bits, group_size} evolved with OpenEvolve (MAP-Elites quality-diversity search). Objective: minimize total bytes subject to perplexity degradation ≤ the public TurboQuant-3bit level. Candidates are scored on real measured bytes and perplexity, and the frontier was re-validated on a held-out test corpus the search never saw.

Evaluation & honesty

  • This is not a general "better model" claim. It is a memory/quality tradeoff claim, measured on the same WikiText-2 eval pipeline against the public TurboQuant model.
  • All perplexities above were measured with the same code on the same ~16k-token WikiText-2 test split, including the real TurboQuant model — apples-to-apples, not cross-paper numbers.
  • The size win (16.8%) is exact (real stored bytes). The quality difference (−0.007 ppl) is within noise → "smaller at equal quality."

License

Apache-2.0, inherited from the base model Qwen/Qwen3.6-35B-A3B.

Downloads last month
1
Safetensors
Model size
35B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inductiveML/Qwen3.6-35B-A3B-evolved-mxbit

Quantized
(489)
this model