Instructions to use RL-MIND/UHR-BAT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RL-MIND/UHR-BAT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="RL-MIND/UHR-BAT", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("RL-MIND/UHR-BAT", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use RL-MIND/UHR-BAT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RL-MIND/UHR-BAT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RL-MIND/UHR-BAT", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/RL-MIND/UHR-BAT
- SGLang
How to use RL-MIND/UHR-BAT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RL-MIND/UHR-BAT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RL-MIND/UHR-BAT", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RL-MIND/UHR-BAT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RL-MIND/UHR-BAT", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use RL-MIND/UHR-BAT with Docker Model Runner:
docker model run hf.co/RL-MIND/UHR-BAT
UHR-BAT
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
ICML 2026
Project Page | Paper | Code
UHR-BAT is a budget-aware vision-language framework for ultra-high-resolution remote sensing imagery. It targets the setting where kilometer-scale scenes contain query-critical evidence that may occupy only a few pixels. Instead of relying on direct downsampling, dense tiling, or generic global pruning, UHR-BAT uses query-guided multi-scale token selection and region-faithful compression to preserve small decisive evidence under a strict context budget.
Highlights
- Query-guided token compression: visual token budgets are allocated according to the current instruction, helping preserve small but decisive evidence.
- Multi-scale input: the model encodes remote-sensing images at multiple target scales to retain both global context and fine-grained local details.
- Region-faithful preserve and merge: informative regional tokens are kept, while redundant background tokens are merged into compact representatives.
- Efficient UHR understanding: the method is designed for quality under memory and latency constraints, not only raw benchmark accuracy.
Main Results
The project page reports strong ultra-high-resolution remote-sensing results under strict token budgets:
- XLRS-Bench: 44.0 weighted average accuracy.
- MMERealworld-RS: 33.33 mean score.
- RSHR-Bench: 29.2 on Perception and 45.0 on Reasoning.
Model Details
This checkpoint contains the full multimodal UHR-BAT model:
- Qwen2/LongVA language backbone
- CLIP ViT-L/14-336 vision tower
- multimodal projector
- multiscale token MLP
- scale positional residual weights
- Hugging Face remote-code wrappers for direct loading
The model repository includes configuration_uhr_bat.py and modeling_uhr_bat.py, so trust_remote_code=True is required when loading the full architecture.
Quick Start
import importlib
import torch
from PIL import Image
from transformers import AutoImageProcessor
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "FelixKAI/UHR-BAT"
image_path = "your_remote_sensing_image.jpg"
question = "Describe this remote-sensing image."
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
image_processor = AutoImageProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
).eval()
# Reuse the preprocessing helpers shipped with the model's remote code.
uhrbat = importlib.import_module(model.__class__.__module__)
image = Image.open(image_path).convert("RGB")
prompt = (
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n<image>\n{question}<|im_end|>\n"
"<|im_start|>assistant\n"
)
image_token_id = getattr(model.config, "image_token_index", -200)
input_ids = uhrbat.tokenizer_image_token(
prompt,
tokenizer,
image_token_id,
return_tensors="pt",
).unsqueeze(0).to(model.device)
attention_mask = torch.ones_like(input_ids)
target_sizes = [672, 1344, 2688, 4032]
multiscale_pixels = [
uhrbat.split_image_to_multiscale_tiles(
image,
image_processor,
target_sizes=target_sizes,
tile_size=336,
)
]
with torch.inference_mode():
output = model.generate(
inputs=input_ids,
attention_mask=attention_mask,
image_sizes=[image.size],
modalities=["image"],
multiscale_pixels=multiscale_pixels,
multiscale_masks=[{}],
multiscale_topk=[80, 320, 600, 2000],
multiscale_target_sizes=target_sizes,
do_sample=False,
max_new_tokens=256,
return_dict_in_generate=True,
output_scores=True,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
prompt_len = output.sequences.shape[1] - len(output.scores)
answer_ids = output.sequences[:, prompt_len:].clone()
answer_ids[answer_ids < 0] = tokenizer.pad_token_id or tokenizer.eos_token_id
answer = tokenizer.decode(answer_ids[0], skip_special_tokens=True).strip()
print(answer)
Citation
@inproceedings{dang2026uhrbat,
title={UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing},
author={Dang, Yunkai and Dai, Minxin and Yang, Yuekun and Li, Zhangnan and Li, Wenbin and Miao, Feng and Gao, Yang},
booktitle={International Conference on Machine Learning (ICML)},
year={2026}
}
- Downloads last month
- 34