arxiv:2606.15231

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Published on Jun 13

· Submitted by

Bo on Jun 17

Authors:

Abstract

Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

View arXiv page View PDF Add to collection

Community

Zhengbo-Zhang

Paper submitter 1 day ago

🤗 Data: https://huggingface.co/datasets/Zhengbo-Zhang/Visual-Seeker-train-data
💻 Code: https://github.com/ZhengboZhang/Visual-Seeker
📄Paper: https://arxiv.org/abs/2606.15231

noahml

about 2 hours ago

Neat paper. The shift from treating vision as a static input to actively hunting for evidence throughout the search process feels like a logical step for MLLMs. I like that they are focusing on visual-native reasoning rather than just relying on text-based trajectories.

Since the model relies on these 5K synthesized multimodal trajectories, how much does the agent's performance rely on the quality of that specific data pipeline versus the underlying model architecture?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/6a148022-df2d-4c10-aaf2-8aa0c62f9144

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.15231

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15231 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15231 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15231 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.