Instructions to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF",
	filename="Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
llama cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
llama cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Use Docker

docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

LM Studio
Jan
Ollama
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
```

Unsloth Studio

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
```

Lemonade

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Run and chat with the model

lemonade run user.Qwen3-Coder-Next-ROCmFP4-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Qwen3-Coder-Next-ROCmFP4-GGUF / README.md

plunderstruck

Repoint build instructions to charlie12345/ROCmFPX (ROCmFPX FP3/4/6/8 repo)

9acfd2c verified 5 days ago

preview code

Raw

History Blame Contribute Delete

24.4 kB

metadata

base_model: Qwen/Qwen3-Coder-Next
license: apache-2.0
library_name: gguf
tags:
  - gguf
  - rocmfp4
  - qwen3next
  - qwen3-coder-next
  - coder
  - moe
  - imatrix
  - strix-halo
  - amd
  - rocm
  - vulkan
language:
  - en
base_model_relation: quantized

PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3-CODER-NEXT
4-BIT ROCmFP4 · 80B-A3B MoE · CODE-WEIGHTED IMATRIX · AGENTIC CODER · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
~4.5 BPW

      ARCH
QWEN3NEXT

      CONTEXT
262 K

    

      PARAMS
80B · A3B MoE

      DRAFT
NO MTP

      BACKEND
VULKAN0

      LICENSE
APACHE-2.0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/ROCmFPX · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels the file. These are ~4.5 bpw 4-bit ROCmFP4 files; pick by filename in Files and versions.

Experimental AMD Strix Halo (gfx1151) quant of Qwen3-Coder-Next — Qwen's agentic coding model (80B total / 3B active high-sparsity MoE, hybrid Gated-DeltaNet attention, arch qwen3next, 262K context) — in the custom ROCmFP4 4-bit format, imatrix-quantized with a code-weighted importance matrix.

01 · FILES

File	Output head	Pick if
`…-STRIX-embQ8-imatrix-headQ6.gguf` ★	Q6_K	the one build — best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — Q8 token embeddings (matching the Q8 source exactly) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in §04) — it's the point where speed and quality meet best. The DeltaNet-specific tensors (ssm_conv1d, ssm_a, norms, router) stay F32; MoE experts + attention/SSM projections are 4-bit ROCmFP4.

NOTE // Q8 embeddings (not f16): the source is Q8_0, so Q8 matches its precision exactly — f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul).

02 · QUICK START

Run from the folder holding the .gguf (the Qwen ChatML template is baked in — just pass --jinja):

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
  --alias coder-next \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk q8_0 \
  -ctv q8_0 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K)
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch · CPU threads
`-ctk q8_0 · -ctv q8_0`	q8_0 (8-bit) KV cache — how we run it; drop to `q4_0` to use less memory, or raise to `f16`
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.7 --top-p 0.8 --top-k 20`	Qwen-Coder recommended sampling
`--jinja --parallel 1 --metrics --no-mmap`	apply baked ChatML template · single slot · metrics · weights in RAM

NOTE // No --spec-* / --spec-type draft-mtp flags — this arch has no MTP head (see §04). It's already fast on its own.

03 · AGENTIC CODING / TOOLS

Qwen3-Coder-Next is an agentic coder — built to call tools, not narrate code. To wire it up:

Chat template: Qwen (ChatML) is baked into the GGUF — just pass --jinja and your client applies it automatically.
Tool calling: enable the qwen3_coder tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools.
Sampling: temp 0.7, top-p 0.8, top-k 20 (Qwen-Coder recommended) — already set in §02.

NOTE // The cross-turn reuse cache (--cache-reuse / --cache-ram) keeps long agentic sessions cheap — the leading prompt isn't re-prefilled every turn.

04 · PERFORMANCE & QUALITY

DECODE · short context	~54 t/s (Vulkan / Ryzen AI Max+ 395)
SPECULATIVE DECODE	none (no MTP head)
LONG CONTEXT	cheap — DeltaNet near-constant memory
QUANTIZATION	fast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win — below)

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by KL divergence (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed — so the fast single-scale body + Q8 embeddings + Q6 head is the right point, and the one file we ship.

This mirrors the fuller sweep on our Qwen3.6-27B sibling, where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost — and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 still couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). (Directional internal measurements — KL vs Q8 on held-out code; reproduce before citing.)

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8 dynamic GGUF of the base from Qwen/Qwen3-Coder-Next — higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.

Fast even without speculative decoding. 3B active params + linear Gated-DeltaNet attention → ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.

NOTE // NO MTP Qwen3-Coder-Next ships without an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the qwen35/qwen35moe archs, not qwen3next. So these are no-MTP (non-speculative) builds — in practice it doesn't matter, it's fast on its own.

The imatrix — code-weighted, and measured (a clean win here). Quantized with an importance matrix built from a code-weighted calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from eaddario/imatrix-calibration, plus Kalomaze's groups_merged (via froggeric/imatrix) for general.

KL-divergence + perplexity vs the Q8 reference on a held-out code slice (disjoint from calibration), imatrix vs no-imatrix:

Metric (vs Q8, held-out code)	No-imatrix	Imatrix	Change
Median KLD	0.00597	0.00478	−20%
90th-pct KLD	0.1342	0.1083	−19%
RMS Δp	8.14%	7.36%	−10%
Same top token as Q8	91.01%	91.49%	+0.48 pp
Mean PPL	3.4556	3.4686	+0.013 (within ±0.077 noise — a wash)

So the imatrix measurably improves quantization fidelity to the full model on code (median KL −20%, the gold-standard metric), at zero cost (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, not an absolute coding benchmark.

NOTE // On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio — the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used.

05 · BUILD (REPRODUCIBLE)

# code-weighted imatrix on the Q8 (single pass; ratio = the real lever)
llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999

# quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head — the ★ file (§01)
# fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \
  Qwen3-Coder-Next-Q8_0.gguf  Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 · LINEAGE & CREDITS

BASE MODEL	Qwen/Qwen3-Coder-Next (Apache-2.0, Qwen team) · 80B-A3B MoE, arch `qwen3next`
CALIBRATION	eaddario/imatrix-calibration (code) · Kalomaze `groups_merged` via froggeric/imatrix (general)
FORMAT + RUNTIME	charlie12345/ROCmFPX (based on llama.cpp, MIT)

Derivative quantization — verify the base model's license before redistribution / use.

FORMAT ROCmFP4 4-BIT	PRECISION ~4.5 BPW	ARCH QWEN3NEXT	CONTEXT 262 K
PARAMS 80B · A3B MoE	DRAFT NO MTP	BACKEND VULKAN0	LICENSE APACHE-2.0