APEX Vision Agentic

Nex-N2-mini

📖 中文文档

Agentic Vision MoE — APEX Quantized GGUF

Thinking Mode Requires Nex's Patched llama.cpp
👉 Using stock llama.cpp (without Nex's patch)? → SC117/Nex-N2-mini-template-fix-APEX-GGUF — works out of the box, no --chat-template-file needed

Nex-N2-mini's original chat template uses complex vision processing macros that stock llama.cpp's Jinja parser cannot handle correctly. This causes thinking tags to not be injected, breaking --reasoning-format.

The official fix: Use Nex's patched llama.cpp, which works with the unmodified GGUF and unmodified template. Once Nex's upstream patch is merged into stock llama.cpp, this workaround will no longer be needed.

⚠️ Do NOT modify chat_template.jinja. The model was trained strictly on the current template — editing the tags deviates from the training-time format and may degrade output quality. See discussion #7.

💡 What is APEX?

These GGUF files are quantized using APEX, a MoE-aware mixed-precision quantization technique that outperforms standard quantization methods while being significantly smaller.

APEX beats Q8_0 perplexity at half the size — and even beats F16.

APEX classifies every tensor by its role — routed expert, shared expert, or attention — and applies a layer-wise precision gradient, giving the most sensitive edge layers higher precision and compressing the redundant middle layers more aggressively.

📦 Available Files
FileSizeBPWNote
Nex-N2-mini.BF16.gguf64.6 GB16.0Full precision reference
Nex-N2-mini-APEX-I-Quality.gguf21.3 GB5.23Highest quality, best accuracy
Nex-N2-mini-APEX-I-Balanced.gguf23.6 GB5.85Best all-rounder, recommended
Nex-N2-mini-APEX-I-Compact.gguf15.4 GB3.81Best quality/size ratio, 16GB VRAM
mmproj-Nex-N2-mini.F16.gguf858 MB-Vision projector (required for image/video)
🧠 Model Details
ArchitectureQwen3.5 MoE (GatedDeltaNet + Full Attention) + Vision Encoder
Parameters35B total, 3B active per token
Experts256 routed experts, 8 active per token
Layers40 layers (30 linear_attn + 10 full_attn)
Context262,144 tokens
VisionImage + Video support (mmproj 858MB)
ThinkingQwen3-style think tags — requires Nex's patched llama.cpp (see above)
🚀 Usage

Download Nex's patched llama.cpp

Binaries: nex-agi/llama.cpp  |  Docker: ghcr.io/nex-agi/llama.cpp:server-cuda-nex-b9596-fix-b9598-8c0d5c9

./llama-server \ -m Nex-N2-mini-APEX-I-Quality.gguf \ -ngl 99 -ncmoe 19 -c 32768 \ --host 0.0.0.0 --port 8081

Replace Nex-N2-mini-APEX-I-Quality.gguf with your preferred quantization tier. Add --mmproj mmproj-Nex-N2-mini.F16.gguf for vision. Recommended sampling: temperature 0.7, top_p 0.95, top_k 40, min_p 0.

📋 Original Model Benchmarks
BenchmarkScoreCategory
BrowseComp74.1Agent
SWE-Bench Verified74.4Coding
Terminal-Bench 2.160.7Coding
GPQA Diamond82.6Reasoning
IFEval89.1Instruction

From the original Nex-N2-mini model card (BF16, full precision).

Using with Stock llama.cpp

If you cannot use Nex's patched llama.cpp, a template-fixed version is available at SC117/Nex-N2-mini-template-fix-APEX-GGUF. These GGUFs have a modified chat_template.jinja embedded so that --reasoning-format works on stock llama.cpp without --chat-template-file.

⚠️ The Nex team explicitly advises against modifying the chat template — the model was trained strictly on the original template, and deviating from the training-time format may degrade output quality. See discussion #7. Use the template-fixed version only if you have no alternative, and be aware of the potential quality trade-off.

Links

Downloads last month
1,328
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SC117/Nex-N2-mini-APEX-GGUF

Quantized
(48)
this model