Instructions to use microsoft/Florence-2-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Florence-2-large with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="microsoft/Florence-2-large", trust_remote_code=True)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/Florence-2-large with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Florence-2-large"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/microsoft/Florence-2-large

SGLang

How to use microsoft/Florence-2-large with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Florence-2-large" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Florence-2-large" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Florence-2-large",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use microsoft/Florence-2-large with Docker Model Runner:
```
docker model run hf.co/microsoft/Florence-2-large
```

Inquiry regarding performance alignment for Florence-2 on COCO dataset

#117

by hongsik91 - opened Apr 16

Discussion

hongsik91

Apr 16

Dear Team Florence-2,

I am a graduate student currently using Florence-2 as a backbone for my vision-language research. I am writing to seek your guidance regarding a performance discrepancy I encountered while reproducing the COCO captioning results.

According to the paper, the zero-shot CIDEr for Florence-2-base is 133.0. However, my local evaluation on the Karpathy test split yields 103.48, and even the fine-tuned version (Florence-2-base-ft) only reaches 111.45.

I have attached my evaluation script (eval_florence2_baseline_hf_datasets.py) for your reference. To summarize my setup:

    dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Florence-2-base", torch_dtype=torch.float16, trust_remote_code=True).to(cuda).eval()
    proc = AutoProcessor.from_pretrained("microsoft/Florence-2-base", trust_remote_code=True)

    task = "<CAPTION>"
    preds: List[Dict] = []
    done: Set[int] = set()

    for s in tqdm(range(0, len(ids), args.batch_size), desc=task):
        bid = ids[s : s +16]
        imgs, wh = [], []
        for i in bid:
            inf = meta[i]
            ip = image_path(coco, i, inf, train_ids)
            imgs.append(Image.open(ip).convert("RGB"))
            wh.append((int(inf["width"]), int(inf["height"])))

        batch = proc(text=[task] * len(imgs), images=imgs, return_tensors="pt", padding=True).to(dev)
        batch["pixel_values"] = batch["pixel_values"].to(torch.float16)
        with torch.no_grad():
            out = model.generate(
                input_ids=batch["input_ids"],
                pixel_values=batch["pixel_values"],
                attention_mask=batch.get("attention_mask"),
                max_new_tokens=1024,
                num_beams=3,
                do_sample=False,
                early_stopping=True
            )
        texts = proc.tokenizer.batch_decode(out, skip_special_tokens=True)
        for i, (w, h), raw in zip(bid, wh, texts):
            cap = proc.post_process_generation(raw, task=task, image_size=(w, h)).get(task, "").strip()
            cap = cap.replace("<s>", "").replace("</s>", "").replace("<pad>", "").strip()
            if i not in done:
                preds.append({"image_id": i, "caption": cap})
                done.add(i)

    cider, bleu = cider_bleu(preds, gts)
    print(f"CIDEr: {cider:.2f}")
    print(f"BLEU:  {bleu[0]:.4f} / {bleu[1]:.4f} / {bleu[2]:.4f} / {bleu[3]:.4f}")

Task Prompt:
Environment: transformers==4.46.3, torch.float16, latest model revision.
Generation Config: num_beams=3, do_sample=False, early_stopping=True, max_new_tokens=1024.
Post-processing: I am using processor.post_process_generation followed by manual cleaning of special tokens (e.g., , ).

Despite following the standard evaluation pipeline, there remains a significant gap (~30 CIDEr points) from the reported baseline. Could you kindly share the specific generation configurations (e.g., beam size, length penalty) or any data preprocessing/prompting details used for the official benchmark?

Thank you for your time and for sharing this impressive model with the community. I look forward to your insights.

Best regards,

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment