Instructions to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("inductiveML/Qwen3.6-35B-A3B-evolved-mxbit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default inductiveML/Qwen3.6-35B-A3B-evolved-mxbit

Run Hermes

hermes

MLX LM

How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.6-35B-A3B — Evolved Mixed-Bit (MLX, 12.63 GB)

TL;DR: 12.63 GB MLX quant of Qwen3.6-35B-A3B. 16.8% smaller than the public TurboQuant-3bit MLX quant at statistically tied WikiText-2 perplexity. Loads with stock mlx-lm, no custom loader.

A per-module mixed-precision quantization of Qwen/Qwen3.6-35B-A3B for Apple Silicon / MLX. The bit-width and group-size of every quantized tensor were chosen by an evolutionary search (OpenEvolve / MAP-Elites) that minimizes memory under a perplexity budget — not a uniform setting.

Why this exists

Apple Silicon local-LLM users usually get a bad tradeoff:

4-bit works, but costs memory
3-bit saves memory, but quality can fall off
clever custom quant recipes often need custom loaders

This release is the boring-useful point: smaller than the known 3-bit quant, same measured perplexity, and still just mlx_lm.load(...).

Results

Perplexity on WikiText-2-raw test (~16,260 tokens, identical eval pipeline for every model, including the real public TurboQuant model). Lower is better.

Model	Size	Perplexity ↓	vs TurboQuant-3bit
Qwen3.6-35B-A3B-4bit (base)	19.51 GB	4.8308	—
Qwen3.6-35B-A3B-TurboQuant-MLX-3bit	15.18 GB	5.6743	—
this model (evolved)	12.63 GB	5.6673	−16.8 % size, −0.007 ppl (tied)

≈ 2.6 effective bits/weight. The Δppl vs TurboQuant (−0.007) is within measurement noise, so read this as "smaller at equal quality," not a quality win.

Usage

uv tool install mlx-lm
mlx_lm.generate --model inductiveML/Qwen3.6-35B-A3B-evolved-mxbit \
  --prompt "Write a short Python function to parse a CSV file."

Or from Python:

from mlx_lm import load, generate
model, tokenizer = load("inductiveML/Qwen3.6-35B-A3B-evolved-mxbit")
print(generate(model, tokenizer, prompt="The capital of France is", max_tokens=64))

Quantization recipe

The fused routed experts (mlp.switch_mlp.*) carry ~90% of the weight bytes, so the search spends its budget there and protects everything small and sensitive:

Component	Precision
Routed experts — `down_proj` (all layers)	3-bit
Routed experts — `gate/up_proj`, layers < 9	3-bit
Routed experts — `gate/up_proj`, layers ≥ 9	2-bit
MoE routers (`mlp.gate`, `shared_expert_gate`)	8-bit (base, unchanged)
Embeddings, `lm_head`, attention, shared expert	4-bit (base, unchanged)

Deeper experts also use group_size=128 (vs 64) to shave overhead. The exact per-module policy is in the config.json quantization field (512 entries).

How it was made

The quantization "recipe" is a function choose_bits(path, info) → {bits, group_size} evolved with OpenEvolve (MAP-Elites quality-diversity search). Objective: minimize total bytes subject to perplexity degradation ≤ the public TurboQuant-3bit level. Candidates are scored on real measured bytes and perplexity, and the frontier was re-validated on a held-out test corpus the search never saw.

Evaluation & honesty

This is not a general "better model" claim. It is a memory/quality tradeoff claim, measured on the same WikiText-2 eval pipeline against the public TurboQuant model.
All perplexities above were measured with the same code on the same ~16k-token WikiText-2 test split, including the real TurboQuant model — apples-to-apples, not cross-paper numbers.
The size win (16.8%) is exact (real stored bytes). The quality difference (−0.007 ppl) is within noise → "smaller at equal quality."

License

Apache-2.0, inherited from the base model Qwen/Qwen3.6-35B-A3B.

Downloads last month: 1

Safetensors

Model size

35B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for inductiveML/Qwen3.6-35B-A3B-evolved-mxbit

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(489)

this model