Instructions to use RL-MIND/UHR-BAT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RL-MIND/UHR-BAT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="RL-MIND/UHR-BAT", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("RL-MIND/UHR-BAT", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RL-MIND/UHR-BAT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RL-MIND/UHR-BAT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RL-MIND/UHR-BAT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/RL-MIND/UHR-BAT

SGLang

How to use RL-MIND/UHR-BAT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RL-MIND/UHR-BAT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RL-MIND/UHR-BAT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RL-MIND/UHR-BAT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RL-MIND/UHR-BAT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use RL-MIND/UHR-BAT with Docker Model Runner:
```
docker model run hf.co/RL-MIND/UHR-BAT
```

UHR-BAT

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

ICML 2026

Project Page | Paper | Code

UHR-BAT is a budget-aware vision-language framework for ultra-high-resolution remote sensing imagery. It targets the setting where kilometer-scale scenes contain query-critical evidence that may occupy only a few pixels. Instead of relying on direct downsampling, dense tiling, or generic global pruning, UHR-BAT uses query-guided multi-scale token selection and region-faithful compression to preserve small decisive evidence under a strict context budget.

Highlights

Query-guided token compression: visual token budgets are allocated according to the current instruction, helping preserve small but decisive evidence.
Multi-scale input: the model encodes remote-sensing images at multiple target scales to retain both global context and fine-grained local details.
Region-faithful preserve and merge: informative regional tokens are kept, while redundant background tokens are merged into compact representatives.
Efficient UHR understanding: the method is designed for quality under memory and latency constraints, not only raw benchmark accuracy.

Main Results

The project page reports strong ultra-high-resolution remote-sensing results under strict token budgets:

XLRS-Bench: 44.0 weighted average accuracy.
MMERealworld-RS: 33.33 mean score.
RSHR-Bench: 29.2 on Perception and 45.0 on Reasoning.

Model Details

This checkpoint contains the full multimodal UHR-BAT model:

Qwen2/LongVA language backbone
CLIP ViT-L/14-336 vision tower
multimodal projector
multiscale token MLP
scale positional residual weights
Hugging Face remote-code wrappers for direct loading

The model repository includes configuration_uhr_bat.py and modeling_uhr_bat.py, so trust_remote_code=True is required when loading the full architecture.

Quick Start

import importlib
import torch
from PIL import Image
from transformers import AutoImageProcessor
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "FelixKAI/UHR-BAT"
image_path = "your_remote_sensing_image.jpg"
question = "Describe this remote-sensing image."

tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token
image_processor = AutoImageProcessor.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
).eval()

# Reuse the preprocessing helpers shipped with the model's remote code.
uhrbat = importlib.import_module(model.__class__.__module__)
image = Image.open(image_path).convert("RGB")

prompt = (
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
    f"<|im_start|>user\n<image>\n{question}<|im_end|>\n"
    "<|im_start|>assistant\n"
)
image_token_id = getattr(model.config, "image_token_index", -200)
input_ids = uhrbat.tokenizer_image_token(
    prompt,
    tokenizer,
    image_token_id,
    return_tensors="pt",
).unsqueeze(0).to(model.device)
attention_mask = torch.ones_like(input_ids)

target_sizes = [672, 1344, 2688, 4032]
multiscale_pixels = [
    uhrbat.split_image_to_multiscale_tiles(
        image,
        image_processor,
        target_sizes=target_sizes,
        tile_size=336,
    )
]

with torch.inference_mode():
    output = model.generate(
        inputs=input_ids,
        attention_mask=attention_mask,
        image_sizes=[image.size],
        modalities=["image"],
        multiscale_pixels=multiscale_pixels,
        multiscale_masks=[{}],
        multiscale_topk=[80, 320, 600, 2000],
        multiscale_target_sizes=target_sizes,
        do_sample=False,
        max_new_tokens=256,
        return_dict_in_generate=True,
        output_scores=True,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

prompt_len = output.sequences.shape[1] - len(output.scores)
answer_ids = output.sequences[:, prompt_len:].clone()
answer_ids[answer_ids < 0] = tokenizer.pad_token_id or tokenizer.eos_token_id
answer = tokenizer.decode(answer_ids[0], skip_special_tokens=True).strip()
print(answer)

Citation

@inproceedings{dang2026uhrbat,
  title={UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing},
  author={Dang, Yunkai and Dai, Minxin and Yang, Yuekun and Li, Zhangnan and Li, Wenbin and Miao, Feng and Gao, Yang},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}

Downloads last month: 34

Safetensors

Model size

8B params

Tensor type

F32

BF16

Paper for RL-MIND/UHR-BAT

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

Paper • 2604.13565 • Published Apr 15 • 1