Text Generation
Transformers
rotorquant
kv-cache-quantization
minimax
m2.7
Mixture of Experts
quantized
Instructions to use majentik/MiniMax-M2.7-RotorQuant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use majentik/MiniMax-M2.7-RotorQuant with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="majentik/MiniMax-M2.7-RotorQuant")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("majentik/MiniMax-M2.7-RotorQuant", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use majentik/MiniMax-M2.7-RotorQuant with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "majentik/MiniMax-M2.7-RotorQuant" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/MiniMax-M2.7-RotorQuant", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/majentik/MiniMax-M2.7-RotorQuant
- SGLang
How to use majentik/MiniMax-M2.7-RotorQuant with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "majentik/MiniMax-M2.7-RotorQuant" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/MiniMax-M2.7-RotorQuant", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "majentik/MiniMax-M2.7-RotorQuant" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/MiniMax-M2.7-RotorQuant", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use majentik/MiniMax-M2.7-RotorQuant with Docker Model Runner:
docker model run hf.co/majentik/MiniMax-M2.7-RotorQuant
| license: other | |
| license_name: minimax-model-license | |
| license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE | |
| base_model: MiniMaxAI/MiniMax-M2.7 | |
| tags: | |
| - rotorquant | |
| - kv-cache-quantization | |
| - minimax | |
| - m2.7 | |
| - moe | |
| - quantized | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # MiniMax-M2.7-RotorQuant | |
| **RotorQuant KV cache compression** for [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7). | |
| This is a **documentation repository** that explains how to combine MiniMax-M2.7's weights with RotorQuant inference-time KV cache compression. No weights are stored here β use the base model directly and apply RotorQuant via the Python package or llama.cpp fork. | |
| ## Hardware compatibility | |
| | Device | VRAM / RAM | Recommendation | | |
| | --- | --- | --- | | |
| | Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant | | |
| ## What is this? | |
| KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β so the same base weights can be used with or without compression. | |
| | Technique | Where it's applied | Savings | | |
| |-----------|-------------------|---------| | |
| | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory | | |
| | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) | | |
| Both can be combined for maximum efficiency. | |
| ## Quickstart | |
| ### Option A β Python / transformers | |
| Install the `rotorquant` package: | |
| ```bash | |
| pip install rotorquant | |
| ``` | |
| Then use it with the base model: | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from rotorquant import IsoQuantCache | |
| tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M2.7", trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "MiniMaxAI/MiniMax-M2.7", | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| # Apply RotorQuant to the KV cache | |
| cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression | |
| inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=128, | |
| past_key_values=cache, | |
| use_cache=True, | |
| ) | |
| print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) | |
| ``` | |
| ### Option B β llama.cpp / LM Studio / Ollama (with fork) | |
| RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require: | |
| - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | |
| Once built: | |
| ```bash | |
| llama-cli -m MiniMax-M2.7.gguf \ | |
| --cache-type-k iso3 --cache-type-v iso3 \ | |
| -ngl 99 -fa \ | |
| -p "Hello" | |
| ``` | |
| For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization. | |
| ## Model Specifications | |
| | Property | Value | | |
| |----------|-------| | |
| | Base Model | [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) | | |
| | Architecture | Sparse MoE (256 experts, 8 active) | | |
| | Parameters | ~456B total (MoE) | | |
| | Context Length | 256K | | |
| | BF16 Size | ~912 GB | | |
| | Modalities | Text | | |
| | License | other | | |
| ## What is RotorQuant? | |
| [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies. | |
| **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β results vary by model and hardware): | |
| - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s) | |
| - Decode: 119 tok/s (vs TurboQuant 93 tok/s) | |
| - Perplexity: 6.91 (vs TurboQuant 7.07) | |
| - Parameters: 4 per rotor (vs TurboQuant 16,384) | |
| > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on MiniMax-M2.7 will differ. Please open a discussion if you have independent results. | |
| ## Current Ecosystem Support | |
| | Runtime | RotorQuant Support | Notes | | |
| |---------|----------------------|-------| | |
| | Python transformers + `rotorquant` | β Full | Drop-in cache class | | |
| | llama.cpp upstream | β Not merged | Use fork below | | |
| | llama-cpp-turboquant fork | β `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | | |
| | LM Studio | β [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative | | |
| | Ollama | β Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` | | |
| | vLLM | β Not supported | β | | |
| | koboldcpp | β Not supported | β | | |
| ## Pre-quantized weight variants | |
| If you want combined weight + KV cache compression, majentik hosts pre-quantized versions: | |
| - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=MiniMax-M2.7+MLX) | |
| - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=MiniMax-M2.7+GGUF) | |
| ## See Also | |
| - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) | |
| - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874) | |
| - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | |
| - [Base model: MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) | |
| ## Variants in this family | |
| (Showing 12 sibling variants under `majentik/minimax-m2.7-*`. The current variant β `RotorQuant` β is **bolded**.) | |
| | Variant | Runtime | Approx size | Use case | | |
| |---|---|---|---| | |
| | **RotorQuant** | runtime modifier | n/a | KV-cache root (weight-agnostic) | | |
| | [RotorQuant-MLX-2bit](https://huggingface.co/majentik/minimax-m2.7-rotorquant-mlx-2bit) | mlx-lm | ~885 MB | Apple Silicon, smallest | | |
| | [RotorQuant-MLX-3bit](https://huggingface.co/majentik/minimax-m2.7-rotorquant-mlx-3bit) | mlx-lm | ~1.2 GB | Apple Silicon, small | | |
| | [RotorQuant-MLX-4bit](https://huggingface.co/majentik/minimax-m2.7-rotorquant-mlx-4bit) | mlx-lm | ~1.7 GB | Apple Silicon balanced | | |
| | [RotorQuant-MLX-5bit](https://huggingface.co/majentik/minimax-m2.7-rotorquant-mlx-5bit) | mlx-lm | ~2.1 GB | Apple Silicon, higher fidelity | | |
| | [RotorQuant-MLX-8bit](https://huggingface.co/majentik/minimax-m2.7-rotorquant-mlx-8bit) | mlx-lm | ~3.2 GB | Apple Silicon reference | | |
| | [TurboQuant](https://huggingface.co/majentik/minimax-m2.7-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | |
| | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/minimax-m2.7-turboquant-mlx-2bit) | mlx-lm | ~885 MB | Apple Silicon, smallest | | |
| | [TurboQuant-MLX-3bit](https://huggingface.co/majentik/minimax-m2.7-turboquant-mlx-3bit) | mlx-lm | ~1.2 GB | Apple Silicon, small | | |
| | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/minimax-m2.7-turboquant-mlx-4bit) | mlx-lm | ~1.7 GB | Apple Silicon balanced | | |
| | [TurboQuant-MLX-5bit](https://huggingface.co/majentik/minimax-m2.7-turboquant-mlx-5bit) | mlx-lm | ~2.1 GB | Apple Silicon, higher fidelity | | |
| | [TurboQuant-MLX-8bit](https://huggingface.co/majentik/minimax-m2.7-turboquant-mlx-8bit) | mlx-lm | ~3.2 GB | Apple Silicon reference | | |