mxfp4_16 Quantization of MiniMaxAI/MiniMax-M2.7

Runtime: Requires tcclaviger/vllm22:latest — a RDNA 4 (gfx12xx) vLLM image with mxfp4_16 kernel support. No other vLLM build currently loads these weights.


1. Introduction

This is an MXFP4-16 (Mixed-precision 4-bit with 16-element group size) quantized variant of MiniMaxAI/MiniMax-M2.7, produced using compressed-tensors with an IQ4_NL codebook.

The quantization:

  • 4-bit weights with 16-element group size, IQ4_NL codebook
  • All Linear layers quantized (MoE experts, FFN, attention projections)
  • Attention k/v_proj scales, router gate, norms, embeddings kept BF16
  • KV cache: FP8 (e4m3), calibrated scales baked into checkpoint

The result fits in ~17.5 GiB per GPU (TP8) while retaining near-BF16 quality.


2. Model Architecture

  • 229B total params (BF16), ~12B activated per token (top-8)
  • 256 experts per MoE layer, top-8 routing, 62 transformer layers
  • 200k context window
  • Native tool-calling support

3. Runtime Requirements

  • GPU: 8× RX 9700 (RDNA 4 / gfx12xx)
  • Memory: 128GB+ system RAM
  • Docker: tcclaviger/vllm22:latest — only validated runtime

The Docker image includes:

  • Custom Triton attention kernels tuned for RDNA4
  • Fixed FP8 KV-cache quantization path
  • Pre-tuned GEMM configs for RX 9700
  • MXFP4-16 kernels for gfx12xx

4. Deployment

Full deployment guide (RDNA4 / RX 9700): docs/vllm_deploy_guide.md

Quick-start:

docker run --name minimax-mxfp416 \
  --rm --tty --ipc=host --shm-size=128g \
  --device /dev/kfd:/dev/kfd \
  --device /dev/dri/renderD128:/dev/dri/renderD128 \
  --device /dev/dri/renderD129:/dev/dri/renderD129 \
  --device /dev/dri/renderD130:/dev/dri/renderD130 \
  --device /dev/dri/renderD132:/dev/dri/renderD132 \
  --device /dev/dri/renderD137:/dev/dri/renderD137 \
  --device /dev/dri/renderD138:/dev/dri/renderD138 \
  --device /dev/dri/renderD139:/dev/dri/renderD139 \
  --device /dev/dri/renderD140:/dev/dri/renderD140 \
  -e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e TRUST_REMOTE_CODE=1 \
  -v /path/to/models:/app/models:ro \
  -p 8000:8000 \
  tcclaviger/vllm22:latest \
  bash -c "cp /app/models/vllm22_minimax_m2.py /app/vllm/vllm/model_executor/models/minimax_m2.py && \
    pip install -q sentencepiece && \
    exec vllm serve /app/models/MiniMax-M2.7-MXFP416 \
      --served-model-name minimax-m2.7-mxfp416 \
      --host 0.0.0.0 --port 8000 --trust-remote-code \
      --tensor-parallel-size 8 --enable-expert-parallel \
      --disable-cascade-attn \
      --reasoning-parser minimax_m2 \
      --enable-auto-tool-choice --tool-call-parser minimax_m2 \
      --enable-prefix-caching --gpu-memory-utilization 0.93 \
      --max-model-len 180000 --max-num-seqs 48 --max-num-batched-tokens 2048 \
      --kv-cache-dtype fp8_e4m3 --attention-backend TRITON_ATTN \
      --override-generation-config '{\"max_tokens\": 16384}'"

Performance (8× RX 9700, 210W power limit)

Metric Value
Generation throughput ~30–35 tokens/s
Prefill throughput up to 2,190 tokens/s (w/ prefix cache)
Prefix cache hit rate ~93%
KV cache memory 11.35 GiB
KV cache capacity 767,856 tokens
Max context per request 180,000 tokens
Max concurrent (180k) 4 requests
Model weight memory (TP8) ~17.5 GiB/GPU

Power tip: Set rocm-smi --setpowerlimit <i> 210 per GPU. At 210W sustained throughput is higher than at full 300W due to reduced thermal throttling.


5. API Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

completion = client.chat.completions.create(
    model="minimax-m2.7-mxfp416",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ],
    temperature=1.0,
    max_tokens=1024,
)
print(completion.choices[0].message.content)

6. Chat Template

The model uses a Jinja chat template supporting system messages, tool calls (<minimax:tool_call>/</minimax:tool_call>), reasoning content (<think>/</think>), and tool responses (<response>).

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained(
    "djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "djdeniro/MiniMax-M2.7-MXFP416",
    device_map="auto", dtype="auto", trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"}
]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

7. Inference Parameters

  • temperature: 1.0
  • top_p: 0.95
  • top_k: 40
  • max_tokens: 16384 (default)

8. Acknowledgments


9. License

Apache 2.0 — inherits from base model.

Downloads last month
-
Safetensors
Model size
130B params
Tensor type
BF16
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for djdeniro/MiniMax-M2.7-MXFP416

Quantized
(113)
this model