Instructions to use djdeniro/MiniMax-M2.7-MXFP416 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use djdeniro/MiniMax-M2.7-MXFP416 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use djdeniro/MiniMax-M2.7-MXFP416 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "djdeniro/MiniMax-M2.7-MXFP416"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "djdeniro/MiniMax-M2.7-MXFP416",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/djdeniro/MiniMax-M2.7-MXFP416

SGLang

How to use djdeniro/MiniMax-M2.7-MXFP416 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "djdeniro/MiniMax-M2.7-MXFP416" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "djdeniro/MiniMax-M2.7-MXFP416",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "djdeniro/MiniMax-M2.7-MXFP416" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "djdeniro/MiniMax-M2.7-MXFP416",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use djdeniro/MiniMax-M2.7-MXFP416 with Docker Model Runner:
```
docker model run hf.co/djdeniro/MiniMax-M2.7-MXFP416
```

`mxfp4_16` Quantization of MiniMaxAI/MiniMax-M2.7

Runtime: Requires tcclaviger/vllm22:latest — a RDNA 4 (gfx12xx) vLLM image with mxfp4_16 kernel support. No other vLLM build currently loads these weights.

1. Introduction

This is an MXFP4-16 (Mixed-precision 4-bit with 16-element group size) quantized variant of MiniMaxAI/MiniMax-M2.7, produced using compressed-tensors with an IQ4_NL codebook.

The quantization:

4-bit weights with 16-element group size, IQ4_NL codebook
All Linear layers quantized (MoE experts, FFN, attention projections)
Attention k/v_proj scales, router gate, norms, embeddings kept BF16
KV cache: FP8 (e4m3), calibrated scales baked into checkpoint

The result fits in ~17.5 GiB per GPU (TP8) while retaining near-BF16 quality.

2. Model Architecture

229B total params (BF16), ~12B activated per token (top-8)
256 experts per MoE layer, top-8 routing, 62 transformer layers
200k context window
Native tool-calling support

3. Runtime Requirements

GPU: 8× RX 9700 (RDNA 4 / gfx12xx)
Memory: 128GB+ system RAM
Docker: tcclaviger/vllm22:latest — only validated runtime

The Docker image includes:

Custom Triton attention kernels tuned for RDNA4
Fixed FP8 KV-cache quantization path
Pre-tuned GEMM configs for RX 9700
MXFP4-16 kernels for gfx12xx

4. Deployment

Full deployment guide (RDNA4 / RX 9700): docs/vllm_deploy_guide.md

Quick-start:

docker run --name minimax-mxfp416 \
  --rm --tty --ipc=host --shm-size=128g \
  --device /dev/kfd:/dev/kfd \
  --device /dev/dri/renderD128:/dev/dri/renderD128 \
  --device /dev/dri/renderD129:/dev/dri/renderD129 \
  --device /dev/dri/renderD130:/dev/dri/renderD130 \
  --device /dev/dri/renderD132:/dev/dri/renderD132 \
  --device /dev/dri/renderD137:/dev/dri/renderD137 \
  --device /dev/dri/renderD138:/dev/dri/renderD138 \
  --device /dev/dri/renderD139:/dev/dri/renderD139 \
  --device /dev/dri/renderD140:/dev/dri/renderD140 \
  -e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e TRUST_REMOTE_CODE=1 \
  -v /path/to/models:/app/models:ro \
  -p 8000:8000 \
  tcclaviger/vllm22:latest \
  bash -c "cp /app/models/vllm22_minimax_m2.py /app/vllm/vllm/model_executor/models/minimax_m2.py && \
    pip install -q sentencepiece && \
    exec vllm serve /app/models/MiniMax-M2.7-MXFP416 \
      --served-model-name minimax-m2.7-mxfp416 \
      --host 0.0.0.0 --port 8000 --trust-remote-code \
      --tensor-parallel-size 8 --enable-expert-parallel \
      --disable-cascade-attn \
      --reasoning-parser minimax_m2 \
      --enable-auto-tool-choice --tool-call-parser minimax_m2 \
      --enable-prefix-caching --gpu-memory-utilization 0.93 \
      --max-model-len 180000 --max-num-seqs 48 --max-num-batched-tokens 2048 \
      --kv-cache-dtype fp8_e4m3 --attention-backend TRITON_ATTN \
      --override-generation-config '{\"max_tokens\": 16384}'"

Performance (8× RX 9700, 210W power limit)

Metric	Value
Generation throughput	~30–35 tokens/s
Prefill throughput	up to 2,190 tokens/s (w/ prefix cache)
Prefix cache hit rate	~93%
KV cache memory	11.35 GiB
KV cache capacity	767,856 tokens
Max context per request	180,000 tokens
Max concurrent (180k)	4 requests
Model weight memory (TP8)	~17.5 GiB/GPU

Power tip: Set rocm-smi --setpowerlimit <i> 210 per GPU. At 210W sustained throughput is higher than at full 300W due to reduced thermal throttling.

5. API Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

completion = client.chat.completions.create(
    model="minimax-m2.7-mxfp416",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ],
    temperature=1.0,
    max_tokens=1024,
)
print(completion.choices[0].message.content)

6. Chat Template

The model uses a Jinja chat template supporting system messages, tool calls (<minimax:tool_call>/</minimax:tool_call>), reasoning content (<think>/</think>), and tool responses (<response>).

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained(
    "djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "djdeniro/MiniMax-M2.7-MXFP416",
    device_map="auto", dtype="auto", trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"}
]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

7. Inference Parameters

temperature: 1.0
top_p: 0.95
top_k: 40
max_tokens: 16384 (default)

8. Acknowledgments

Base model: MiniMaxAI/MiniMax-M2.7
Quantization inspiration: tcclaviger/Step-3.7-Flash-240REAP-MXFP416
Runtime: tcclaviger/vllm22

9. License

Apache 2.0 — inherits from base model.

Downloads last month: -

Safetensors

Model size

130B params

Tensor type

BF16

F16

Model tree for djdeniro/MiniMax-M2.7-MXFP416

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(113)

this model

mxfp4_16 Quantization of MiniMaxAI/MiniMax-M2.7