Kimi-K2.7-Code-NVFP4
๐ Available now โ try this model at cogito.decart.ai and via OpenRouter.
Description:
NVFP4 quantized version of moonshotai/Kimi-K2.7-Code, quantized with NVIDIA Model Optimizer. The routed-expert linear layers are quantized to NVFP4 (4-bit float, block size 16) with an FP8 KV cache; attention (MLA), shared experts, the layer-0 dense MLP, lm_head, and the vision tower / mm_projector remain BF16 โ the same precision split as nvidia/Kimi-K2.6-NVFP4. Ready for inference with vLLM on NVIDIA Blackwell.
This is a community reproduction produced by Decart; it is not affiliated with or endorsed by NVIDIA or Moonshot AI.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA or Decart's base-model providers. It is built to a third-party's requirements; see the non-Decart Kimi-K2.7-Code Model Card.
License/Terms of Use:
Use of this model is governed by the license of the base model, moonshotai/Kimi-K2.7-Code (Modified MIT).
Model Architecture:
Architecture Type: Transformers
Network Architecture: DeepSeek-V3 (MLA attention, 384 routed experts + 1 shared expert, 61 layers), wrapped with a vision tower + mm_projector (KimiK25ForConditionalGeneration)
Number of Model Parameters: ~1T total / ~32B activated
Input:
Input Type(s): Text (Image/Video per base model)
Input Format(s): String
Other Properties Related to Input: Long context per base model
Output:
Output Type(s): Text
Output Format: String
Software Integration:
Supported Runtime Engine(s): vLLM
Supported Hardware Microarchitecture Compatibility: NVIDIA Blackwell
Preferred Operating System(s): Linux
Post Training Quantization
This model was obtained by converting and quantizing the routed-expert weights and activations of Kimi-K2.7-Code from its native INT4 (compressed-tensors pack-quantized) โ BF16 โ NVFP4, ready for inference with vLLM. Only the weights and activations of the linear operators within the MoE transformer blocks are quantized.
- Recipe:
general/ptq/nvfp4_experts_only_mse-kv_fp8_cast(MSE-static weight scales via FP8 scale sweep; dynamic NVFP4 input scales; FP8 KV cache). - Group size: 16. KV cache: FP8.
- Quantized modules: 69,120 = 60 layers ร 384 experts ร 3 projections (
gate_proj,up_proj,down_proj). - The
exclude_modules/quantization_configmatchnvidia/Kimi-K2.6-NVFP4.
Calibration Dataset:
cnn_dailymail + Nemotron-Post-Training-Dataset-v2; calib_size=512, calib_seq=512. Full expert coverage was achieved during calibration (0 experts required the max-based amax backstop).
Usage
To serve this checkpoint with vLLM (vllm/vllm-openai:latest):
VLLM_USE_FLASHINFER_MOE_FP4=1 python3 -m vllm.entrypoints.openai.api_server \
--model decart-ai/Kimi-K2.7-Code-NVFP4 \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8 \
--tool-call-parser kimi_k2 --enable-auto-tool-choice \
--trust-remote-code
Validated load + generation on NVIDIA B300 with vLLM 0.23.
Reproducibility Notes
- Quantized with NVIDIA Model Optimizer (commit
cc17f2c). - Loading the base INT4 checkpoint requires
compressed-tensors==0.14.0.x(CompressedLinearwas deprecated in 0.15+). - Export was run with
CUDA_LAUNCH_BLOCKING=1to avoid an asynchronous CUDA fault in the decompress/quantize path observed on Blackwell.
Evaluation
Coding-focused evaluation against the INT4 baseline is pending and will be added.
Model Limitations
The base model was trained on internet data that may contain toxic language and societal biases; the model may reflect these and may generate inaccurate, incomplete, or otherwise undesirable output. Validate with use-case-specific testing before deployment.
- Downloads last month
- 508
Model tree for decart-ai/Kimi-K2.7-Code-NVFP4
Base model
moonshotai/Kimi-K2.7-Code