Instructions to use microsoft/Florence-2-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Florence-2-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="microsoft/Florence-2-large", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True) model = AutoModelForImageTextToText.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Florence-2-large with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Florence-2-large" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-large", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/Florence-2-large
- SGLang
How to use microsoft/Florence-2-large with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Florence-2-large" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-large", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Florence-2-large" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Florence-2-large", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/Florence-2-large with Docker Model Runner:
docker model run hf.co/microsoft/Florence-2-large
Inquiry regarding performance alignment for Florence-2 on COCO dataset
Dear Team Florence-2,
I am a graduate student currently using Florence-2 as a backbone for my vision-language research. I am writing to seek your guidance regarding a performance discrepancy I encountered while reproducing the COCO captioning results.
According to the paper, the zero-shot CIDEr for Florence-2-base is 133.0. However, my local evaluation on the Karpathy test split yields 103.48, and even the fine-tuned version (Florence-2-base-ft) only reaches 111.45.
I have attached my evaluation script (eval_florence2_baseline_hf_datasets.py) for your reference. To summarize my setup:
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-base", torch_dtype=torch.float16, trust_remote_code=True).to(cuda).eval()
proc = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)
task = "<CAPTION>"
preds: List[Dict] = []
done: Set[int] = set()
for s in tqdm(range(0, len(ids), args.batch_size), desc=task):
bid = ids[s : s +16]
imgs, wh = [], []
for i in bid:
inf = meta[i]
ip = image_path(coco, i, inf, train_ids)
imgs.append(Image.open(ip).convert("RGB"))
wh.append((int(inf["width"]), int(inf["height"])))
batch = proc(text=[task] * len(imgs), images=imgs, return_tensors="pt", padding=True).to(dev)
batch["pixel_values"] = batch["pixel_values"].to(torch.float16)
with torch.no_grad():
out = model.generate(
input_ids=batch["input_ids"],
pixel_values=batch["pixel_values"],
attention_mask=batch.get("attention_mask"),
max_new_tokens=1024,
num_beams=3,
do_sample=False,
early_stopping=True
)
texts = proc.tokenizer.batch_decode(out, skip_special_tokens=True)
for i, (w, h), raw in zip(bid, wh, texts):
cap = proc.post_process_generation(raw, task=task, image_size=(w, h)).get(task, "").strip()
cap = cap.replace("<s>", "").replace("</s>", "").replace("<pad>", "").strip()
if i not in done:
preds.append({"image_id": i, "caption": cap})
done.add(i)
cider, bleu = cider_bleu(preds, gts)
print(f"CIDEr: {cider:.2f}")
print(f"BLEU: {bleu[0]:.4f} / {bleu[1]:.4f} / {bleu[2]:.4f} / {bleu[3]:.4f}")
Task Prompt:
Environment: transformers==4.46.3, torch.float16, latest model revision.
Generation Config: num_beams=3, do_sample=False, early_stopping=True, max_new_tokens=1024.
Post-processing: I am using processor.post_process_generation followed by manual cleaning of special tokens (e.g., , ).
Despite following the standard evaluation pipeline, there remains a significant gap (~30 CIDEr points) from the reported baseline. Could you kindly share the specific generation configurations (e.g., beam size, length penalty) or any data preprocessing/prompting details used for the official benchmark?
Thank you for your time and for sharing this impressive model with the community. I look forward to your insights.
Best regards,