Papers
arxiv:2606.09516

SwiftVR: Real-Time One-Step Generative Video Restoration

Published on Jun 8
· Submitted by
Xiangyu Chen
on Jun 9
Authors:
,
,
,
,
,
,
,

Abstract

SwiftVR enables real-time video restoration on consumer GPUs through efficient attention mechanisms and lightweight autoencoding, achieving high frame rates at 4K resolution.

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

Community

Paper submitter

SwiftVR is a one-step generative video restoration framework designed for real-time streaming. While recent one-step diffusion methods reduce the number of denoising steps, they still struggle with high-resolution deployment due to expensive spatial attention and heavy video autoencoders. The main contribution is a deployment-oriented design: SwiftVR uses mask-free shifted-window self-attention to keep attention on the standard dense SDPA path, avoiding attention masks, cyclic shifts, padding, custom sparse kernels, or hardware-specific retraining. It also introduces a lightweight restoration-aware autoencoder and a causal chunk-wise streaming protocol. The reported runtime is impressive: 54 FPS at 1080p, 31 FPS at 1440p, and 14 FPS at 4K on a single H100. On a consumer RTX 5090, SwiftVR reaches 26 FPS at 1080p, making real-time generative video restoration on consumer hardware much more practical. The method also achieves strong no-reference perceptual quality, especially on MUSIQ, CLIP-IQA, and DISTS, while producing sharper and more natural details in real-world videos. A nice aspect of this work is that the speedup comes from architecture and implementation choices that remain compatible with standard dense attention backends, rather than relying on custom sparse kernels. This makes SwiftVR not only fast, but also easier to deploy.

Cool paper - I liked the way "SwiftVR: Real-Time One-Step Generative Video Restoration" frames the problem without making it feel too abstract.

Curious if you think this would still work once the setup gets messier in the wild?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/8c85c855-c00a-4c90-9cfc-e1e41f73716e

·

Thanks for the kind words, and for making the ResearchPod episode!

Yes, we do think SwiftVR can still work in messier real-world scenarios, including more complex scenes. That said, we should be honest that its enhancement quality is still noticeably weaker than some high-compute, non-real-time restoration methods. There are still technical bottlenecks that have not been fully solved, so we are not 100% satisfied with the model’s performance yet.

Our current view is that SwiftVR makes a fairly good trade-off between restoration quality, inference speed, and computational cost. The main goal is not to outperform every offline enhancement method, but to make generative video restoration practical in real-time settings.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09516
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09516 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09516 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.