Friday-VLM

Friday-VLM is a finetune of the text-only Phi-4-mini-reasoning model that enables multimodal (image + text) instruction following. The architecture and config live in this repo, so callers must load the model with trust_remote_code=True.

Model variants

Repo ID	Precision	File format	Typical VRAM*	Size on disk
`kevin510/friday`	bf16 (full)	`safetensors`	100 %	100 %
`kevin510/friday-fp4`	fp4 (bitsandbytes int4)	`safetensors`	≈ 30 %	≈ 25 %

Dependencies

conda create --name friday python=3.12 -y
conda activate friday
pip install transformers torch torchvision  deepspeed accelerate pillow einops timm

Quick start

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import logging

tok = AutoTokenizer.from_pretrained("kevin510/friday", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "kevin510/friday",
    trust_remote_code=True,
    device_map="auto" 
)
model.eval()

prompt = "Describe this image."
user_prompt = f"<|user|><image>\n{prompt}\n<|assistant|>"
inputs = tok(user_prompt, return_tensors="pt").to(model.device)

image = Image.open("my_image.jpg").convert("RGB")

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        images=[image]
    )

print(tok.decode(out[0], skip_special_tokens=False))

Architecture at a glance

FastViT-HD ─▶ 3072-d patch embeddings ─▶ S2 6144-d patch embeddings ─▶  2-layer MLP vision-adapter (6144 → 3072)

(vision tokens, 3072 d) ─┐
├─► Φ-4-mini-reasoning (2.7 B params, hidden = 3072)
<text tokens, 3072 d> ───┘ │
│ (standard self-attention only;
│ language tower is frozen at finetune)

Limitations & Responsible AI

Friday-VLM may hallucinate objects, invent facts, or reproduce societal biases. All variants share the same behaviour profile; quantisation does not filter or sanitise model outputs. Users must apply their own content-safety layer before deployment.

Citation

@misc{friday2025,
  title   = {Friday VLM: Efficient Instruction-Tuned Vision–Language Modelling},
  author  = {Kevin Rohling},
  year    = {2025},
  url     = {https://huggingface.co/kevin510/friday}
}

Downloads last month: 8

Safetensors

Model size

4B params

Tensor type

F32

BF16

Model tree for kevin510/friday

Base model

kevin510/fast-vit-hd

Finetuned

(1)

this model

kevin510
/

friday