Instructions to use AxionML/Kimi-K2.5-MXFP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AxionML/Kimi-K2.5-MXFP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AxionML/Kimi-K2.5-MXFP8", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AxionML/Kimi-K2.5-MXFP8", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AxionML/Kimi-K2.5-MXFP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AxionML/Kimi-K2.5-MXFP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxionML/Kimi-K2.5-MXFP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/AxionML/Kimi-K2.5-MXFP8
- SGLang
How to use AxionML/Kimi-K2.5-MXFP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AxionML/Kimi-K2.5-MXFP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxionML/Kimi-K2.5-MXFP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AxionML/Kimi-K2.5-MXFP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AxionML/Kimi-K2.5-MXFP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use AxionML/Kimi-K2.5-MXFP8 with Docker Model Runner:
docker model run hf.co/AxionML/Kimi-K2.5-MXFP8
AxionML Kimi-K2.5-MXFP8
Developed by AxionML for open-source serving and deployment use cases. Part of AxionML's effort to provide ready-to-serve quantized models for the community.
This is an MXFP8-quantized version of moonshotai/Kimi-K2.5 (1T total parameters, 32B activated), quantized using NVIDIA TensorRT Model Optimizer. Weights and activations of linear layers are quantized to FP8, reducing disk size and GPU memory by ~2x compared to BF16.
About MXFP8 quantization: MXFP8 (Microscaling FP8) uses the E4M3 format with per-block scaling factors to maintain accuracy while halving memory footprint. Unlike coarser per-tensor schemes, microscaling applies fine-grained scaling over small element groups, preserving dynamic range across layers with heterogeneous activation distributions. On NVIDIA Hopper and Blackwell GPUs, FP8 Tensor Cores deliver up to 2x the throughput of BF16 with negligible accuracy loss for well-calibrated models.
Ready for commercial and non-commercial use under Modified MIT.
Model Summary
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Number of Layers | 61 (including 1 dense layer) |
| Number of Experts | 384 routed, 1 shared, 8 selected per token |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |
| Vision Encoder | MoonViT (400M parameters) |
| Context Length | 256K |
| Vocabulary Size | 160K |
Evaluation Results (BF16 Baseline)
| Benchmark | Kimi K2.5 (Thinking) |
|---|---|
| Reasoning & Knowledge | |
| HLE-Full | 30.1 |
| HLE-Full (w/ tools) | 50.2 |
| AIME 2025 | 96.1 |
| HMMT 2025 (Feb) | 95.4 |
| IMO-AnswerBench | 81.8 |
| GPQA-Diamond | 87.6 |
| MMLU-Pro | 87.1 |
| Image & Video | |
| MMMU-Pro | 78.5 |
| CharXiv (RQ) | 77.5 |
| MathVision | 84.2 |
| MathVista (mini) | 90.1 |
| ZeroBench | 9 |
| Coding | |
| SWE-bench Verified | 65.4 |
| LiveCodeBench | 74.6 |
| Codeforces | 2131 |
| Agentic | |
| TAU-Bench (Airline) | 72.6 |
| TAU-Bench (Retail) | 68.4 |
| OSWorld (15 steps) | 41.2 |
| BrowserGym | 57.3 |
Scores are from the Kimi-K2.5 model card. MXFP8 quantization is expected to produce negligible accuracy degradation (<0.5%) on these benchmarks.
Quantization Details
This model was quantized by applying MXFP8 to the weights and activations of linear operators within transformer blocks. Vision encoder weights are kept in their original precision.
- Quantization format: MXFP8 (E4M3 with microscaling)
- Calibration dataset: Nemotron-Post-Training-Dataset-v2
- Tool: TensorRT Model Optimizer
Usage
Deploy with SGLang
python3 -m sglang.launch_server \
--model-path AxionML/Kimi-K2.5-MXFP8 \
--tp 8 \
--trust-remote-code
Reproduce with ModelOpt
python3 examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path moonshotai/Kimi-K2.5 \
--qformat mxfp8 \
--export_path ./kimi-k2.5-mxfp8
Limitations
The base model was trained on data that may contain toxic language and societal biases. The quantized model inherits these limitations. It may generate inaccurate, biased, or offensive content. Please refer to the original model card for full details.
- Downloads last month
- 34
Model tree for AxionML/Kimi-K2.5-MXFP8
Base model
moonshotai/Kimi-K2.5