Reshoot-Anything
Reshoot-Anything is a self-supervised video reshooting model built on top of Wan2.2-I2V-A14B. Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency β trained entirely on in-the-wild monocular videos.
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
Morphic Inc. Β· arXiv:2604.21776
| Source Video | Reshot Video |
![]() |
![]() |
Model Files
This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B):
| File | Role | Notes |
|---|---|---|
jan06_scaling_80k_ckpt1400.safetensors |
High-noise expert | Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture. |
dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors |
Low-noise expert | Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss. |
Both files are ~9.82 GB each.
Quickstart
1. Clone the repository
git clone https://github.com/morphicfilms/video-to-video.git
cd video-to-video
Follow the Wan2.2 installation guide to set up the environment, or run:
bash setup_env.sh
2. Download the weights
Download the Wan2.2 I2V base weights:
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
Download the Reshoot-Anything LoRA weights:
huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights
3. Prepare your anchor video
At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's anchor_generation/ scripts for details.
4. Run reshooting
torchrun --nproc_per_node=8 generate.py \
--task v2v-A14B \
--size 1280*720 \
--frame_num 81 \
--ckpt_dir ./Wan2.2-I2V-A14B \
--high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \
--low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \
--source_video examples/source.mp4 \
--anchor_video examples/anchor.mp4 \
--dit_fsdp \
--t5_fsdp \
--ulysses_size 8
Note: Refer to the GitHub README for the authoritative argument names and single-GPU usage.
How It Works
Reshoot-Anything adapts the Wan2.2-14B Mixture-of-Experts (MoE) DiT with two key architectural changes:
Dual-stream token conditioning β Both the anchor video V_a (geometric guide) and source video V_s (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions.
Offset RoPE β A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory.
The model learns implicit 4D spatiotemporal routing β when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video.
Self-Supervised Training Pipeline
Training requires no paired multi-view data. From a single monocular video:
- Two independent smooth random-walk crop trajectories are sampled β source
V_sand targetV_t V_s[0]is forward-warped via AllTracker dense flow + crop offset β anchorV_a- The triplet
(V_s, V_a, V_t)forms the training signal
A hybrid dataset strategy augments the monocular pipeline with a 15% mixture of paired synthetic data from ReCamMaster, enabling generalization to extreme (120Β°+) orbital camera trajectories.
Training Details
| Parameter | Value |
|---|---|
| Base model | Wan2.2-I2V-A14B (14B MoE) |
| LoRA rank | 512 (attention + FFN) |
| Training steps | 2,000 per expert |
| Batch size | 24 |
| Learning rate | 1e-5 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.999) |
| Loss | MSE + 0.1 Γ L1 source reconstruction |
| Latent frames | 20 |
| Primary data | ~100k clips from 30k monocular videos |
| Synthetic mixture | 15% ReCamMaster paired clips |
Key augmentations: 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss.
Evaluation
Results on 100 five-second videos from OpenSora-Mixkit (16fps, 480p):
| Method | CLIP-F β | RotErr β | TransErr β | Mat. Pix β | FVD-V β | CLIP-V β |
|---|---|---|---|---|---|---|
| ReCamMaster | 98.49 | 11.29 | 19.59 | 1314.00 | 732.52 | 88.91 |
| EX-4D | 98.94 | 3.94 | 4.21 | 2188.98 | 685.63 | 89.77 |
| TrajectoryCrafter (49f) | 98.80 | 2.26 | 3.03 | 1851.80 | 582.56 | 92.40 |
| Ours | 99.03 | 2.76 | 4.23 | 2720.83 | 586.24 | 93.16 |
| Ours (49f) | 99.01 | 2.61 | 2.73 | 2737.65 | 488.22 | 94.96 |
Citation
@article{paliwal2026reshootanything,
title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
journal={arXiv preprint arXiv:2604.21776},
year={2026}
}
License
Model weights are released under the Apache 2.0 license, consistent with the Wan2.2 base model.
- Downloads last month
- -
Model tree for morphic/reshoot-anything
Base model
Wan-AI/Wan2.2-I2V-A14B
