Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation

📖 Introduction

We present Hallo-Live, a real-time text-driven joint audio-video avatar generation framework. The method adopts a causal dual-stream DiT model to generate synchronized avatar video and speech in a streaming manner. Hallo-Live reaches 20.38 FPS with 0.94 s latency on two NVIDIA H200 GPUs, while preserving strong lip-sync accuracy, visual fidelity, and speech quality.

🏗️ Framework

The framework of Hallo-Live. Top left: Stage I training adapts a pretrained dual-stream DiT to the streaming setting using cross-modal future-expanding block-causal mask. Bottom left: Stage II training performs autoregressive self-rollout with the audio-video KV cache and optimizes the generated trajectory with reward-weighted dual-stream DMD. Right: Each causal fusion block in the dual-stream DiT consists of cross-modal attention between the video and audio streams, where the block-causal masks are utilized in Stage I ODE initialization, and KV cache is maintained for Stage II self-rollout and streaming inference.

Downloads last month: 22

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fudan-generative-ai/Hallo-Live

Base model

Wan-AI/Wan2.2-TI2V-5B

Finetuned

chetwinlow1/Ovi

Finetuned

(2)

this model

Paper for fudan-generative-ai/Hallo-Live

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

Paper • 2604.23632 • Published 6 days ago • 1