Kimi-K2.7-Code-NVFP4

๐Ÿš€ Available now โ€” try this model at cogito.decart.ai and via OpenRouter.

Description:

NVFP4 quantized version of moonshotai/Kimi-K2.7-Code, quantized with NVIDIA Model Optimizer. The routed-expert linear layers are quantized to NVFP4 (4-bit float, block size 16) with an FP8 KV cache; attention (MLA), shared experts, the layer-0 dense MLP, lm_head, and the vision tower / mm_projector remain BF16 โ€” the same precision split as nvidia/Kimi-K2.6-NVFP4. Ready for inference with vLLM on NVIDIA Blackwell.

This is a community reproduction produced by Decart; it is not affiliated with or endorsed by NVIDIA or Moonshot AI.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA or Decart's base-model providers. It is built to a third-party's requirements; see the non-Decart Kimi-K2.7-Code Model Card.

License/Terms of Use:

Use of this model is governed by the license of the base model, moonshotai/Kimi-K2.7-Code (Modified MIT).

Model Architecture:

Architecture Type: Transformers
Network Architecture: DeepSeek-V3 (MLA attention, 384 routed experts + 1 shared expert, 61 layers), wrapped with a vision tower + mm_projector (KimiK25ForConditionalGeneration)
Number of Model Parameters: ~1T total / ~32B activated

Input:

Input Type(s): Text (Image/Video per base model)
Input Format(s): String
Other Properties Related to Input: Long context per base model

Output:

Output Type(s): Text
Output Format: String

Software Integration:

Supported Runtime Engine(s): vLLM
Supported Hardware Microarchitecture Compatibility: NVIDIA Blackwell
Preferred Operating System(s): Linux

Post Training Quantization

This model was obtained by converting and quantizing the routed-expert weights and activations of Kimi-K2.7-Code from its native INT4 (compressed-tensors pack-quantized) โ†’ BF16 โ†’ NVFP4, ready for inference with vLLM. Only the weights and activations of the linear operators within the MoE transformer blocks are quantized.

  • Recipe: general/ptq/nvfp4_experts_only_mse-kv_fp8_cast (MSE-static weight scales via FP8 scale sweep; dynamic NVFP4 input scales; FP8 KV cache).
  • Group size: 16. KV cache: FP8.
  • Quantized modules: 69,120 = 60 layers ร— 384 experts ร— 3 projections (gate_proj, up_proj, down_proj).
  • The exclude_modules / quantization_config match nvidia/Kimi-K2.6-NVFP4.

Calibration Dataset:

cnn_dailymail + Nemotron-Post-Training-Dataset-v2; calib_size=512, calib_seq=512. Full expert coverage was achieved during calibration (0 experts required the max-based amax backstop).

Usage

To serve this checkpoint with vLLM (vllm/vllm-openai:latest):

VLLM_USE_FLASHINFER_MOE_FP4=1 python3 -m vllm.entrypoints.openai.api_server \
  --model decart-ai/Kimi-K2.7-Code-NVFP4 \
  --tensor-parallel-size 8 \
  --kv-cache-dtype fp8 \
  --tool-call-parser kimi_k2 --enable-auto-tool-choice \
  --trust-remote-code

Validated load + generation on NVIDIA B300 with vLLM 0.23.

Reproducibility Notes

  • Quantized with NVIDIA Model Optimizer (commit cc17f2c).
  • Loading the base INT4 checkpoint requires compressed-tensors==0.14.0.x (CompressedLinear was deprecated in 0.15+).
  • Export was run with CUDA_LAUNCH_BLOCKING=1 to avoid an asynchronous CUDA fault in the decompress/quantize path observed on Blackwell.

Evaluation

Coding-focused evaluation against the INT4 baseline is pending and will be added.

Model Limitations

The base model was trained on internet data that may contain toxic language and societal biases; the model may reflect these and may generate inaccurate, incomplete, or otherwise undesirable output. Validate with use-case-specific testing before deployment.

Downloads last month
508
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for decart-ai/Kimi-K2.7-Code-NVFP4

Quantized
(21)
this model