plunderstruck's picture
Repoint build instructions to charlie12345/ROCmFPX (ROCmFPX FP3/4/6/8 repo)
9acfd2c verified
|
Raw
History Blame Contribute Delete
24.4 kB
metadata
base_model: Qwen/Qwen3-Coder-Next
license: apache-2.0
library_name: gguf
tags:
  - gguf
  - rocmfp4
  - qwen3next
  - qwen3-coder-next
  - coder
  - moe
  - imatrix
  - strix-halo
  - amd
  - rocm
  - vulkan
language:
  - en
base_model_relation: quantized
PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO Β· gfx1151
            β–—β–‡β–‡β–‡β–‡β–‡β–‡β–‡β––                 
           β–—β–ˆβ–˜β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ––                
          β–—β–›   β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–…     
         β–Ÿβ–›    β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™β––   
   β–„β–„β–„β–„β–„β–Ÿβ–›    β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ––  
 β–—β–ˆβ–ˆβ–Œ    β–šβ––   β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–ˆβ–˜  
β–—β–ˆβ–ˆβ–ˆβ–ˆβ––    β–œβ––                    β–—β–ˆβ–˜   
β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™    β–œβ–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–€β–€β–€β–€β–€β–œβ–™    
 β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™    β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›       β–œβ–™   
  β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™    β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›    β–ƒ    β–œβ–™  
   β–€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™β––   β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–˜    β–Ÿβ–ˆβ–™    β–€β–™ 
    β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ––   β–β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–˜    β–Ÿβ–ˆβ–ˆβ–ˆβ–™β–‚β–‚β–‚β–‚β–β–ˆ
    β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ––    β–œβ–ˆβ–ˆβ–ˆβ–˜   β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›
   β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„    β–œβ–›    β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€ 
  β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€        β–—β–›    β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€β–€β–€β–€β–€β–˜  
    β–œβ–ˆβ–ˆβ–˜        β–—β–›    β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›β–˜        
     β–œβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ––   β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›          
                β–β–ˆβ–– β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›           
                 β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€            
QWEN3-CODER-NEXT
4-BIT ROCmFP4 Β· 80B-A3B MoE Β· CODE-WEIGHTED IMATRIX Β· AGENTIC CODER Β· SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.5 BPW
ARCH
QWEN3NEXT
CONTEXT
262 K
PARAMS
80B Β· A3B MoE
DRAFT
NO MTP
BACKEND
VULKAN0
LICENSE
APACHE-2.0
⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/ROCmFPX Β· branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge β€” its parser can't read ROCmFP4 and mislabels the file. These are ~4.5 bpw 4-bit ROCmFP4 files; pick by filename in Files and versions.

Experimental AMD Strix Halo (gfx1151) quant of Qwen3-Coder-Next β€” Qwen's agentic coding model (80B total / 3B active high-sparsity MoE, hybrid Gated-DeltaNet attention, arch qwen3next, 262K context) β€” in the custom ROCmFP4 4-bit format, imatrix-quantized with a code-weighted importance matrix.

01 Β· FILES
File Output head Pick if
…-STRIX-embQ8-imatrix-headQ6.gguf β˜…Q6_Kthe one build β€” best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body

One file β€” the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt β€” Q8 token embeddings (matching the Q8 source exactly) and a Q6_K output head β€” on the fast single-scale q4_0_rocmfp4_fast body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in Β§04) β€” it's the point where speed and quality meet best. The DeltaNet-specific tensors (ssm_conv1d, ssm_a, norms, router) stay F32; MoE experts + attention/SSM projections are 4-bit ROCmFP4.

NOTE // Q8 embeddings (not f16): the source is Q8_0, so Q8 matches its precision exactly β€” f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul).
02 Β· QUICK START

Run from the folder holding the .gguf (the Qwen ChatML template is baked in β€” just pass --jinja):

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
  --alias coder-next \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk q8_0 \
  -ctv q8_0 \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap
Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan β€” fastest backend for ROCmFP4 on Strix Halo
-ngl 999 Β· -fa onoffload all layers Β· flash attention
-c 262144context length (256K)
-b 2048 Β· -ub 256 Β· -t/-tb 16prefill batch / micro-batch Β· CPU threads
-ctk q8_0 Β· -ctv q8_0q8_0 (8-bit) KV cache β€” how we run it; drop to q4_0 to use less memory, or raise to f16
-cpent Β· -ctxcp Β· --cache-reuse Β· --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.7 --top-p 0.8 --top-k 20Qwen-Coder recommended sampling
--jinja --parallel 1 --metrics --no-mmapapply baked ChatML template Β· single slot Β· metrics Β· weights in RAM
NOTE // No --spec-* / --spec-type draft-mtp flags β€” this arch has no MTP head (see Β§04). It's already fast on its own.
03 Β· AGENTIC CODING / TOOLS

Qwen3-Coder-Next is an agentic coder β€” built to call tools, not narrate code. To wire it up:

  • Chat template: Qwen (ChatML) is baked into the GGUF β€” just pass --jinja and your client applies it automatically.
  • Tool calling: enable the qwen3_coder tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools.
  • Sampling: temp 0.7, top-p 0.8, top-k 20 (Qwen-Coder recommended) β€” already set in Β§02.
NOTE // The cross-turn reuse cache (--cache-reuse / --cache-ram) keeps long agentic sessions cheap β€” the leading prompt isn't re-prefilled every turn.
04 Β· PERFORMANCE & QUALITY
DECODE Β· short context~54 t/s (Vulkan / Ryzen AI Max+ 395)
SPECULATIVE DECODEnone (no MTP head)
LONG CONTEXTcheap β€” DeltaNet near-constant memory
QUANTIZATIONfast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win β€” below)

This is the best speed/quality balance in ROCmFP4 β€” by design, not the absolute fastest. On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by KL divergence (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed β€” so the fast single-scale body + Q8 embeddings + Q6 head is the right point, and the one file we ship.

This mirrors the fuller sweep on our Qwen3.6-27B sibling, where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost β€” and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 still couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). (Directional internal measurements β€” KL vs Q8 on held-out code; reproduce before citing.)

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8 dynamic GGUF of the base from Qwen/Qwen3-Coder-Next β€” higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.

Fast even without speculative decoding. 3B active params + linear Gated-DeltaNet attention β†’ ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.

NOTE // NO MTP Qwen3-Coder-Next ships without an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the qwen35/qwen35moe archs, not qwen3next. So these are no-MTP (non-speculative) builds β€” in practice it doesn't matter, it's fast on its own.

The imatrix β€” code-weighted, and measured (a clean win here). Quantized with an importance matrix built from a code-weighted calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from eaddario/imatrix-calibration, plus Kalomaze's groups_merged (via froggeric/imatrix) for general.

KL-divergence + perplexity vs the Q8 reference on a held-out code slice (disjoint from calibration), imatrix vs no-imatrix:

Metric (vs Q8, held-out code) No-imatrix Imatrix Change
Median KLD0.005970.00478βˆ’20%
90th-pct KLD0.13420.1083βˆ’19%
RMS Ξ”p8.14%7.36%βˆ’10%
Same top token as Q891.01%91.49%+0.48 pp
Mean PPL3.45563.4686+0.013 (within Β±0.077 noise β€” a wash)

So the imatrix measurably improves quantization fidelity to the full model on code (median KL βˆ’20%, the gold-standard metric), at zero cost (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, not an absolute coding benchmark.

NOTE // On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio β€” the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used.
05 Β· BUILD (REPRODUCIBLE)
# code-weighted imatrix on the Q8 (single pass; ratio = the real lever)
llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999

# quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head β€” the β˜… file (Β§01)
# fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \
  Qwen3-Coder-Next-Q8_0.gguf  Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf  Q4_0_ROCMFP4_STRIX

Experimental research build for AMD Strix Halo β€” hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

06 Β· LINEAGE & CREDITS
BASE MODELQwen/Qwen3-Coder-Next (Apache-2.0, Qwen team) Β· 80B-A3B MoE, arch qwen3next
CALIBRATIONeaddario/imatrix-calibration (code) Β· Kalomaze groups_merged via froggeric/imatrix (general)
FORMAT + RUNTIMEcharlie12345/ROCmFPX (based on llama.cpp, MIT)

Derivative quantization β€” verify the base model's license before redistribution / use.