Valley Family: Exploring Scalable Vision-Language Design for Multimodal Understanding and Reasoning

🎮️ Github Repository | 🤗 Huggingface Model Collection | 📙 Technical Report

News

[2026/05/05] 🔥🔥🔥 We have released the technical report of Valley3! Check out the full paper here: Valley3 Technical Report.
[2026/05/01] 🔥🔥🔥 We have released the model weights of Valley3, which is an omni foundation model collection for unified e-commerce understanding and reasoning.
[2025/11/27] 🔥🔥 We have released the technical report of Valley2.5! Check out the full paper here: Valley2.5 Technical Report.
[2025/10/26] 🔥🔥 We have released the weights of Valley2.5, which significantly enhances multimodal understanding and reasoning capabilities. It has achieved 74.3 on the OpenCompass Multi-modal Academic Leaderboard!

Introduction

Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model:

Achieved the best results in the inhouse e-commerce and short-video benchmarks, much better then other SOTA opensource models.
Demonstrated comparatively outstanding performance in the OpenCompass Benchmark.

Valley3

Valley3 is built upon the Qwen3-VL backbone and extends it with audio transformer for audio encoding. The audio embeddings are aligned to the visual-language backbone via an MLP-based connector, then concatenated with visual and text tokens into a unified input space, enabling omni-modal understanding.