Instructions to use mlx-community/Tmax-27B-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Tmax-27B-MLX-8bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/Tmax-27B-MLX-8bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use mlx-community/Tmax-27B-MLX-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/Tmax-27B-MLX-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/Tmax-27B-MLX-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/Tmax-27B-MLX-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/Tmax-27B-MLX-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/Tmax-27B-MLX-8bit
Run Hermes
hermes
- MLX LM
How to use mlx-community/Tmax-27B-MLX-8bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mlx-community/Tmax-27B-MLX-8bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mlx-community/Tmax-27B-MLX-8bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/Tmax-27B-MLX-8bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Tmax-27B MLX (8bit)
MLX-converted text-only weights of allenai/tmax-27b.
The upstream base ships as a multimodal Qwen3_5ForConditionalGeneration
config but contains zero vision tensors in its safetensors — i.e. it is
already a text-only checkpoint with stub vision metadata. This release
strips the residual vision_config / image-token entries so it loads
cleanly via mlx_lm without a vision tower.
- Source:
allenai/tmax-27b - License: Apache-2.0
- Variant:
8bit - Quantized by: raullenchai
- Tooling:
mlx-lm 0.31.3(the upstreammlx_vlm 0.3.12qwen3_5 loader hard-requires vision-tower weights that this base does not ship, so the text-onlymlx_lm.convertpath is used instead) - Chat template: ships with the source repo (
chat_template.jinja) - Tool format:
qwen3_xml-compatible (<tool_call>{json}</tool_call>)
Usage
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Tmax-27B-MLX-8bit")
print(generate(model, tokenizer, prompt="Hello", max_tokens=32))
Notes
- This is a pure text-generation MLX release. No vision/image inputs.
- For best chat behavior, use the chat template that ships with this repo.
Benchmarks
Measured on M3 Ultra Studio (28 (20 Performance and 8 Efficiency) CPU, 60-core GPU, 256 GB unified memory) via rapid-mlx 0.8.18. Medians of 3 runs.
| Variant | Decode tok/s | TTFT (ms) | Prefill 1k (tok/s) | Prefill 4k (tok/s) | Prefill 16k (tok/s) | Tool-call e2e |
|---|---|---|---|---|---|---|
| Tmax-27B (8-bit MLX) | 22.1 | 301 | 308 | 319 | 308 | 2681 ms (OK) |
Architecture note: Tmax-27B uses a hybrid Gated-DeltaNet design (3:1 linear-attention to full-attention layer mix). 16k-context prefill is bandwidth-bound at ~310 tok/s regardless of quantization bit width — ~53 s wall to first token at 16k. This is an architectural property of hybrid linear-attention models on Apple Silicon, not a regression, and not a rapid-mlx bug. Decode and short-context (≤4k) tool-call performance are competitive with the dense Qwen3.5-27B-4bit control on the same hardware.
Full results (all 7 Tmax MLX variants + 2 Qwen3.5 controls): rapid-mlx docs.
Reproduce:
pip install rapid-mlx==0.8.18
rapid-mlx serve tmax-27b-8bit --port 8765
- Downloads last month
- -
8-bit