Phi-4-multimodal-instruct W4A16 GPTQ

GPTQ W4A16 quantization of microsoft/Phi-4-multimodal-instruct β€” a 5.6 B parameter multimodal model by Microsoft supporting text, vision (images), and audio inputs.

Quantized with llm-compressor on RTX 5090. Weights stored in compressed-tensors format β€” natively loaded by vLLM.

License: MIT β€” Β© Microsoft Corporation. This quantization carries the same MIT license as the original model.


Why this quantization?

bf16 safetensors GGUF (Q4_K_M + mmproj) This model (W4A16 GPTQ)
Size ~14 GB 2.37 GB + 825 MB ~5–6 GB
Text βœ… βœ… βœ…
Vision (images) βœ… βœ… βœ…
Audio / Speech βœ… ❌ βœ…
Serves with vLLM llama.cpp / LM Studio vLLM
Quantization method none GGML int4 GPTQ int4 (W4A16)

The GGUF files in Swicked86/phi4-mm-gguf are smaller but lack audio. This model is the sweet spot: all three modalities at roughly β…“ the size of bf16.


Available Files

File Size Notes
model-00001-of-00002.safetensors ~3 GB Quantized weight shard 1
model-00002-of-00002.safetensors ~2 GB Quantized weight shard 2
config.json β€” Includes quantization_config β€” vLLM auto-detects
tokenizer.model / tokenizer.json β€” Tokenizer
preprocessor_config.json β€” Vision + audio processor config (bf16 encoders)

The SigLIP-400M vision encoder and conformer-based speech encoder are stored at full bfloat16 precision β€” only the Phi3 text transformer weights (32 decoder layers) are quantized to int4.


VRAM Requirements

Model weights occupy 6.11 GiB (measured). The remaining VRAM is used by the KV cache β€” vLLM pre-allocates the full KV cache pool at startup based on --gpu-memory-utilization and --max-model-len. Total VRAM allocated = weights + pre-allocated KV cache, regardless of how many requests are active.

By GPU tier

GPU VRAM Recommended --max-model-len --gpu-memory-utilization Notes
RTX 3070 / 2080 Super 8 GB β€” β€” ⚠️ Not recommended. Weights alone are 6.1 GB; insufficient headroom for KV cache.
RTX 3080 10 GB / 2080 Ti 10 GB 16,384 0.85 Minimum viable. Tight β€” use lowest context only.
RTX 3080 12 GB / 4070 12 GB 16,384–32,768 0.85 Comfortable at 16K; 32K fits with care.
RTX 3080 Ti / 4070 Ti / 4080 16 GB 32,768–65,520 0.85–0.90 Good balance of context and headroom.
RTX 3090 / 4090 / 4080 Super 24 GB 65,520 0.85–0.90 Recommended. Full tested context, comfortable.
RTX 5090 / A6000 / A100 40 GB 32+ GB 65,520–131,072 0.45–0.90 Plenty of headroom; lower utilization keeps VRAM free for other tasks.

By context length

--max-model-len Weights KV cache (est.) Total (est.) Min GPU VRAM
16,384 ~6.1 GB ~1.5 GB ~8 GB 10 GB
32,768 ~6.1 GB ~3.0 GB ~10 GB 12 GB
65,520 ~6.1 GB ~6.0 GB ~13 GB 16 GB
131,072 (max, untested) ~6.1 GB ~12.0 GB ~19 GB 24 GB

KV cache estimates use Phi-4-Mini architecture (32 layers, 8 KV heads, head_dim 96, bf16 activations β‰ˆ 96 KB/token). Add ~1–2 GB for framework overhead. Weights measured on RTX 5090 with vLLM.

Why does vLLM show higher usage than "total est." above? vLLM pre-allocates the entire KV cache pool at startup. On a large GPU (e.g. 32 GB at --gpu-memory-utilization 0.45), it reserves 0.45 Γ— 32 GB = ~14 GB for KV cache even if no requests are active. The table above shows the minimum needed, not what vLLM will allocate when given more headroom.


Usage

Step 1 β€” Install vLLM

Requirements: Python 3.10+, CUDA GPU with β‰₯ 8 GB VRAM, vLLM 0.9.0+

pip install vllm

Do not install auto-gptq or pass --quantization gptq. This model uses compressed-tensors format, which vLLM handles automatically from config.json.


Step 2 β€” Download the model

huggingface-cli download Swicked86/phi4-mm-gptq --local-dir ./phi4-mm-gptq

Or let vLLM download on first run by passing the repo ID directly (see Step 3).


Step 3 β€” Launch the vLLM server

Minimum working command (text + vision + audio, 8–10 GB GPU):

python -m vllm.entrypoints.openai.api_server \
  --model ./phi4-mm-gptq \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.85 \
  --enable-lora \
  --max-lora-rank 320 \
  --lora-modules speech=./phi4-mm-gptq/speech-lora \
                 vision=./phi4-mm-gptq/vision-lora \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm

Extended context command (16 GB+ GPU, 65K context):

python -m vllm.entrypoints.openai.api_server \
  --model ./phi4-mm-gptq \
  --dtype bfloat16 \
  --trust-remote-code \
  --max-model-len 65520 \
  --gpu-memory-utilization 0.90 \
  --enable-lora \
  --max-lora-rank 320 \
  --lora-modules speech=./phi4-mm-gptq/speech-lora \
                 vision=./phi4-mm-gptq/vision-lora \
  --limit-mm-per-prompt '{"image": 3, "audio": 3}' \
  --enable-auto-tool-choice \
  --tool-call-parser phi4_mini_json \
  --port 8080 \
  --host 127.0.0.1 \
  --served-model-name phi4-mm

Flag reference:

Flag Value Why
--model path or Swicked86/phi4-mm-gptq Local dir or HF repo ID
--dtype bfloat16 required Model native dtype β€” do not use float16
--trust-remote-code required phi4-mm uses custom modeling code
--max-model-len 16384–131072 See VRAM table above. 65,520 = 16Γ—4095 (vLLM block-aligned). Beyond 65,520 is untested with this quantization β€” proceed at your own risk
--gpu-memory-utilization 0.45–0.90 Fraction of GPU VRAM to reserve for weights + KV cache
--enable-lora required for vision/audio Activates the rank-320 LoRA adapters
--max-lora-rank 320 required phi4-mm LoRAs are rank 320 (unusually large)
--lora-modules speech=... vision=... Points to the adapter subdirs β€” enables those modalities
--limit-mm-per-prompt {"image":3,"audio":3} Max attachments per message
--tool-call-parser phi4_mini_json optional phi4-mm emits functools[...] format β€” this parses it
--served-model-name phi4-mm optional Alias so clients use "model": "phi4-mm"

No --quantization flag needed. vLLM reads quantization_config from config.json and activates the compressed-tensors int4 kernels automatically.

Wait for "Application startup complete":

curl http://localhost:8080/health   # β†’ {"status":"ok"}

Text (Python β€” openai SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Vision β€” image understanding (Python)

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

curl:

IMAGE_B64=$(base64 -w0 photo.jpg)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/jpeg;base64,${IMAGE_B64}\"}},
        {\"type\": \"text\", \"text\": \"What is in this image?\"}
      ]
    }],
    \"max_tokens\": 300
  }"

Audio β€” speech transcription / understanding (Python)

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("audio.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Transcribe this audio."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

phi4-mm uses a custom conformer-based audio encoder (24 conformer blocks) with a rank-320 speech LoRA applied to the language decoder β€” no separate ASR model needed. Supported formats: wav, mp3, ogg, flac.

curl:

AUDIO_B64=$(base64 -w0 audio.wav)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"phi4-mm\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"input_audio\", \"input_audio\": {\"data\": \"${AUDIO_B64}\", \"format\": \"wav\"}},
        {\"type\": \"text\", \"text\": \"Transcribe and summarise.\"}
      ]
    }],
    \"max_tokens\": 512
  }"

Combined β€” image + audio in one prompt

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

with open("photo.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()
with open("question.wav", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="phi4-mm",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",  "image_url":  {"url": f"data:image/jpeg;base64,{image_b64}"}},
            {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
            {"type": "text", "text": "Answer the spoken question about the image."},
        ],
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

Tool calling

phi4-mm emits tool calls as functools[{"name":"...","arguments":{...}}]. The --tool-call-parser phi4_mini_json flag (vLLM 0.7+) handles this automatically. For a complete chat template that injects tools into phi4-mm's native <|tool|>...<|/tool|> block, see deploy/wsl-vllm/phi4-mm-tool-template.jinja in the companion repo.


Load locally with Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model_id = "Swicked86/phi4-mm-gptq"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Requires pip install llmcompressor (or pip install compressed-tensors) to load the quantization_config from the checkpoint.


Quality

Inference Tests

All tests run via vLLM on RTX 5090, --gpu-memory-utilization 0.45.

Text β€” factual recall (< 0.5s):

Prompt: "What is the capital of France?" Response: "The capital of France is Paris." βœ…

Text β€” math reasoning (0.748s):

Prompt: "Solve step by step: If a train travels 120 miles in 2 hours, what is its speed in km/h?" Response: step-by-step solution β†’ 96.54 km/h βœ…

Text β€” code generation (1.068s):

Prompt: "Write a Python function that checks if a string is a palindrome." Response: correct is_palindrome() with docstring + example calls βœ…

Vision β€” real image (3000Γ—4000 JPEG):

Prompt: "Describe what you see in this image in detail." Response: correctly identified anime figure on a TV screen, described the room, entertainment setup, and animation style βœ…

Audio β€” real voice message (Discord OGG Opus, converted to 16kHz WAV):

Input: Discord voice message (~11s) discussing software development Response: "The speaker is describing the process of transforming the rag function into a function that uses a local database rather than writing to and from files." βœ…

Audio note: Discord voice messages are OGG Opus at 48 kHz. Convert to 16 kHz mono WAV before sending for best results. Pass "format": "wav" in the request.

vLLM LoRA note: vLLM currently only applies LoRA to the language model layers. Vision encoder LoRA layers (SigLIP) are silently skipped β€” this is a vLLM limitation. The speech LoRA (language decoder, rank-320) loaded and applied correctly.


Perplexity (wikitext-2-raw, context 512)

Model PPL vs bf16
bf16 (baseline) 14.9338 Β± 0.107 β€”
W4A16 GPTQ (this model) pending pending

Benchmark will be added after upload.


Quantization Details

Item Value
Quantizer llm-compressor (GPTQModifier)
Scheme W4A16 (int4 weights, bfloat16 activations)
Group size 128
Sequential targets Phi3DecoderLayer (32 Γ— Phi3 text transformer blocks)
Excluded (kept bf16) lm_head, model.embed_tokens_extend.*
↳ covers SigLIP-400M vision encoder + conformer-based audio encoder
Calibration 512 samples, wikitext-2
Source model ~/phi4-mm-hf (safetensors, downloaded from HF)
Hardware RTX 5090 32 GB, CUDA 12.0, WSL2 Ubuntu 24.04
Script scripts/quantize_phi4mm.py

Architecture

Property Value
Base model Phi-4-Mini (3.8 B LLM backbone)
Total parameters ~5.6 B
Context length 128 K tokens (131,072)
Modalities Text, Vision (SigLIP-400M), Audio/Speech (Conformer + rank-320 LoRA)
Text decoder 32 Γ— Phi3DecoderLayer β€” quantized to int4
Vision encoder SigLIP2 (embed_tokens_extend.image_embed) β€” bf16
Audio encoder Conformer-based audio encoder (24-block, 460M) + speech LoRA rank-320 β€” bf16

Related

Downloads last month
5
Safetensors
Model size
3B params
Tensor type
I64
Β·
F32
Β·
I32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Swicked86/phi4-mm-gptq

Quantized
(7)
this model