arxiv:2606.09516

SwiftVR: Real-Time One-Step Generative Video Restoration

Published on Jun 8

· Submitted by

Authors:

Abstract

SwiftVR enables real-time video restoration on consumer GPUs through efficient attention mechanisms and lightweight autoencoding, achieving high frame rates at 4K resolution.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

View arXiv page View PDF Project page GitHub 8 Add to collection

Community

chxy95

Paper submitter about 10 hours ago

SwiftVR is a one-step generative video restoration framework designed for real-time streaming. While recent one-step diffusion methods reduce the number of denoising steps, they still struggle with high-resolution deployment due to expensive spatial attention and heavy video autoencoders. The main contribution is a deployment-oriented design: SwiftVR uses mask-free shifted-window self-attention to keep attention on the standard dense SDPA path, avoiding attention masks, cyclic shifts, padding, custom sparse kernels, or hardware-specific retraining. It also introduces a lightweight restoration-aware autoencoder and a causal chunk-wise streaming protocol. The reported runtime is impressive: 54 FPS at 1080p, 31 FPS at 1440p, and 14 FPS at 4K on a single H100. On a consumer RTX 5090, SwiftVR reaches 26 FPS at 1080p, making real-time generative video restoration on consumer hardware much more practical. The method also achieves strong no-reference perceptual quality, especially on MUSIQ, CLIP-IQA, and DISTS, while producing sharper and more natural details in real-world videos. A nice aspect of this work is that the speedup comes from architecture and implementation choices that remain compatible with standard dense attention backends, rather than relying on custom sparse kernels. This makes SwiftVR not only fast, but also easier to deploy.

noahml

about 6 hours ago

Cool paper - I liked the way "SwiftVR: Real-Time One-Step Generative Video Restoration" frames the problem without making it feel too abstract.

Curious if you think this would still work once the setup gets messier in the wild?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/8c85c855-c00a-4c90-9cfc-e1e41f73716e

chxy95

about 5 hours ago

Thanks for the kind words, and for making the ResearchPod episode!

Yes, we do think SwiftVR can still work in messier real-world scenarios, including more complex scenes. That said, we should be honest that its enhancement quality is still noticeably weaker than some high-compute, non-real-time restoration methods. There are still technical bottlenecks that have not been fully solved, so we are not 100% satisfied with the model’s performance yet.

Our current view is that SwiftVR makes a fairly good trade-off between restoration quality, inference speed, and computational cost. The main goal is not to outperform every offline enhancement method, but to make generative video restoration practical in real-time settings.