Papers
arxiv:2603.29252

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Published on Mar 31
Authors:
,
,
,
,
,
,
,

Abstract

FlexMem enables long video understanding for multimodal large language models by mimicking human visual memory mechanisms through flexible memory transfer and reading strategies.

AI-generated summary

Long video understanding is a key challenge that plagues the advancement of Multimodal Large language Models (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed Flexible Memory (FlexMem). In principle, FlexMem aims to mimic human behavior of video watching, i.e., continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on a single 3090 GPU, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than 1k frames, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, e.g. , GPT-4o and Gemini-1.5 Pro.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.29252
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.29252 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.29252 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.29252 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.