Agents-A1 — MLX (bf16)

MLX conversion of InternScience/Agents-A1, in bf16. The source checkpoint is already bf16, so this is a lossless format conversion — not a quantization.

Agents-A1 is a Qwen3.5-MoE vision-language agent model (qwen3_5_moe, Qwen3_5MoeForConditionalGeneration): 40 decoder layers, 256 routed experts per layer + a shared expert, hidden size 2048, with a vision tower and video preprocessing.

Running it

Multimodal (VLM) — load with mlx-vlm (mlx-lm can't load multimodal architectures):

pip install mlx-vlm
python -m mlx_vlm.generate --model mlx-community/Agents-A1-bf16 \
  --prompt "What is 17 * 24? Think step by step." --max-tokens 512
# with an image:
python -m mlx_vlm.generate --model mlx-community/Agents-A1-bf16 --image img.jpg --prompt "Describe this image."

Loads and runs in stock mlx-vlm — no patched code needed at inference.

Throughput

Measured with oMLX's benchmark harness on a Macbook Pro M5 Max 128GB 40 GPU — gen 128 tokens, cold prefill (unique prompt prefix per request, no cache reuse).

Single request (batch 1) — decode tok/s by context

Context	bf16	8-bit	6-bit	5-bit	4-bit	3-bit
1,024	67.6	95.4	95.2	98.2	117.4	133.0
4,096	67.6	94.0	97.3	102.8	119.5	130.4
8,192	66.8	91.7	95.3	103.1	115.7	126.9
16,384	64.7	88.0	91.5	80.5	105.8	119.8
32,768	60.9	80.6	88.6	80.2	95.6	104.2
65,536	53.5	68.4	67.6	66.6	75.4	83.5
131,072	40.7	48.7	50.9	48.2	50.3	52.5
Peak RAM (GB)	66–69	35–39	27–31	23–26	19–22	15–18

TTFT (cold prefill) is ~precision-independent — ≈0.3 s @1k, 3 s @8k, 21 s @32k, 63 s @64k, ~225 s @128k — prefill is compute-bound, not weight-bound.

Continuous batching (1k context) — aggregate decode tok/s

Batch	bf16	8-bit	6-bit	5-bit	4-bit	3-bit
1	67.6	95.4	95.2	98.2	117.4	133.0
2	62.5	151.0	156.5	160.6	190.9	188.7
4	107.1	202.0	185.1	195.7	239.9	230.2
8	129.6	252.4	223.4	238.7	289.0	276.1

Aggregate across the batch; per-request rate is that value divided by the batch size.

Smoke test

17 x 24 -> correct (408), coherent, no repetition.

Other precisions

Precision	Repo	Size on disk
bf16 (full)	Agents-A1-bf16	~65 GB
8-bit	Agents-A1-8bit	~35 GB
6-bit	Agents-A1-6bit	~27 GB
5-bit	Agents-A1-5bit	~23 GB
4-bit	Agents-A1-4bit	~19 GB
3-bit	Agents-A1-3bit	~15 GB

License

apache-2.0, inherited from the base model.

Downloads last month: -

Safetensors

Model size

35B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/Agents-A1-bf16

Base model

InternScience/Agents-A1

Finetuned

(1)

this model

Collection including mlx-community/Agents-A1-bf16

Agents-A1

Collection

MLX versions of InternScience/Agents-A1 • 6 items • Updated 1 day ago • 1