UHR-BAT

UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

ICML 2026

Project Page arXiv GitHub Code Hugging Face Model ICML 2026

Project Page | Paper | Code

UHR-BAT is a budget-aware vision-language framework for ultra-high-resolution remote sensing imagery. It targets the setting where kilometer-scale scenes contain query-critical evidence that may occupy only a few pixels. Instead of relying on direct downsampling, dense tiling, or generic global pruning, UHR-BAT uses query-guided multi-scale token selection and region-faithful compression to preserve small decisive evidence under a strict context budget.

Highlights

  • Query-guided token compression: visual token budgets are allocated according to the current instruction, helping preserve small but decisive evidence.
  • Multi-scale input: the model encodes remote-sensing images at multiple target scales to retain both global context and fine-grained local details.
  • Region-faithful preserve and merge: informative regional tokens are kept, while redundant background tokens are merged into compact representatives.
  • Efficient UHR understanding: the method is designed for quality under memory and latency constraints, not only raw benchmark accuracy.

Main Results

The project page reports strong ultra-high-resolution remote-sensing results under strict token budgets:

  • XLRS-Bench: 44.0 weighted average accuracy.
  • MMERealworld-RS: 33.33 mean score.
  • RSHR-Bench: 29.2 on Perception and 45.0 on Reasoning.

Model Details

This checkpoint contains the full multimodal UHR-BAT model:

  • Qwen2/LongVA language backbone
  • CLIP ViT-L/14-336 vision tower
  • multimodal projector
  • multiscale token MLP
  • scale positional residual weights
  • Hugging Face remote-code wrappers for direct loading

The model repository includes configuration_uhr_bat.py and modeling_uhr_bat.py, so trust_remote_code=True is required when loading the full architecture.

Quick Start

import importlib
import torch
from PIL import Image
from transformers import AutoImageProcessor
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "FelixKAI/UHR-BAT"
image_path = "your_remote_sensing_image.jpg"
question = "Describe this remote-sensing image."

tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token
image_processor = AutoImageProcessor.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
).eval()

# Reuse the preprocessing helpers shipped with the model's remote code.
uhrbat = importlib.import_module(model.__class__.__module__)
image = Image.open(image_path).convert("RGB")

prompt = (
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
    f"<|im_start|>user\n<image>\n{question}<|im_end|>\n"
    "<|im_start|>assistant\n"
)
image_token_id = getattr(model.config, "image_token_index", -200)
input_ids = uhrbat.tokenizer_image_token(
    prompt,
    tokenizer,
    image_token_id,
    return_tensors="pt",
).unsqueeze(0).to(model.device)
attention_mask = torch.ones_like(input_ids)

target_sizes = [672, 1344, 2688, 4032]
multiscale_pixels = [
    uhrbat.split_image_to_multiscale_tiles(
        image,
        image_processor,
        target_sizes=target_sizes,
        tile_size=336,
    )
]

with torch.inference_mode():
    output = model.generate(
        inputs=input_ids,
        attention_mask=attention_mask,
        image_sizes=[image.size],
        modalities=["image"],
        multiscale_pixels=multiscale_pixels,
        multiscale_masks=[{}],
        multiscale_topk=[80, 320, 600, 2000],
        multiscale_target_sizes=target_sizes,
        do_sample=False,
        max_new_tokens=256,
        return_dict_in_generate=True,
        output_scores=True,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

prompt_len = output.sequences.shape[1] - len(output.scores)
answer_ids = output.sequences[:, prompt_len:].clone()
answer_ids[answer_ids < 0] = tokenizer.pad_token_id or tokenizer.eos_token_id
answer = tokenizer.decode(answer_ids[0], skip_special_tokens=True).strip()
print(answer)

Citation

@inproceedings{dang2026uhrbat,
  title={UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing},
  author={Dang, Yunkai and Dai, Minxin and Yang, Yuekun and Li, Zhangnan and Li, Wenbin and Miao, Feng and Gao, Yang},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}
Downloads last month
34
Safetensors
Model size
8B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for RL-MIND/UHR-BAT