Instructions to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("inductiveML/Qwen3.6-35B-A3B-evolved-mxbit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default inductiveML/Qwen3.6-35B-A3B-evolved-mxbit
Run Hermes
hermes
- MLX LM
How to use inductiveML/Qwen3.6-35B-A3B-evolved-mxbit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inductiveML/Qwen3.6-35B-A3B-evolved-mxbit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3.6-35B-A3B — Evolved Mixed-Bit (MLX, 12.63 GB)
TL;DR: 12.63 GB MLX quant of Qwen3.6-35B-A3B. 16.8% smaller than the public TurboQuant-3bit MLX quant at statistically tied WikiText-2 perplexity. Loads with stock
mlx-lm, no custom loader.
A per-module mixed-precision quantization of
Qwen/Qwen3.6-35B-A3B for Apple Silicon / MLX.
The bit-width and group-size of every quantized tensor were chosen by an evolutionary search
(OpenEvolve / MAP-Elites) that minimizes memory under a perplexity budget — not a uniform setting.
Why this exists
Apple Silicon local-LLM users usually get a bad tradeoff:
- 4-bit works, but costs memory
- 3-bit saves memory, but quality can fall off
- clever custom quant recipes often need custom loaders
This release is the boring-useful point: smaller than the known 3-bit quant, same measured
perplexity, and still just mlx_lm.load(...).
Results
Perplexity on WikiText-2-raw test (~16,260 tokens, identical eval pipeline for every model, including the real public TurboQuant model). Lower is better.
| Model | Size | Perplexity ↓ | vs TurboQuant-3bit |
|---|---|---|---|
| Qwen3.6-35B-A3B-4bit (base) | 19.51 GB | 4.8308 | — |
| Qwen3.6-35B-A3B-TurboQuant-MLX-3bit | 15.18 GB | 5.6743 | — |
| this model (evolved) | 12.63 GB | 5.6673 | −16.8 % size, −0.007 ppl (tied) |
≈ 2.6 effective bits/weight. The Δppl vs TurboQuant (−0.007) is within measurement noise, so read this as "smaller at equal quality," not a quality win.
Usage
uv tool install mlx-lm
mlx_lm.generate --model inductiveML/Qwen3.6-35B-A3B-evolved-mxbit \
--prompt "Write a short Python function to parse a CSV file."
Or from Python:
from mlx_lm import load, generate
model, tokenizer = load("inductiveML/Qwen3.6-35B-A3B-evolved-mxbit")
print(generate(model, tokenizer, prompt="The capital of France is", max_tokens=64))
Quantization recipe
The fused routed experts (mlp.switch_mlp.*) carry ~90% of the weight bytes, so the search
spends its budget there and protects everything small and sensitive:
| Component | Precision |
|---|---|
Routed experts — down_proj (all layers) |
3-bit |
Routed experts — gate/up_proj, layers < 9 |
3-bit |
Routed experts — gate/up_proj, layers ≥ 9 |
2-bit |
MoE routers (mlp.gate, shared_expert_gate) |
8-bit (base, unchanged) |
Embeddings, lm_head, attention, shared expert |
4-bit (base, unchanged) |
Deeper experts also use group_size=128 (vs 64) to shave overhead. The exact per-module policy
is in the config.json quantization field (512 entries).
How it was made
The quantization "recipe" is a function choose_bits(path, info) → {bits, group_size} evolved
with OpenEvolve (MAP-Elites quality-diversity search).
Objective: minimize total bytes subject to perplexity degradation ≤ the public TurboQuant-3bit
level. Candidates are scored on real measured bytes and perplexity, and the frontier was
re-validated on a held-out test corpus the search never saw.
Evaluation & honesty
- This is not a general "better model" claim. It is a memory/quality tradeoff claim, measured on the same WikiText-2 eval pipeline against the public TurboQuant model.
- All perplexities above were measured with the same code on the same ~16k-token WikiText-2 test split, including the real TurboQuant model — apples-to-apples, not cross-paper numbers.
- The size win (16.8%) is exact (real stored bytes). The quality difference (−0.007 ppl) is within noise → "smaller at equal quality."
License
Apache-2.0, inherited from the base model Qwen/Qwen3.6-35B-A3B.
- Downloads last month
- 1
4-bit
Model tree for inductiveML/Qwen3.6-35B-A3B-evolved-mxbit
Base model
Qwen/Qwen3.6-35B-A3B