Reshoot-Anything

GitHub arXiv Website

Reshoot-Anything is a self-supervised video reshooting model built on top of Wan2.2-I2V-A14B. Given a source video and a target camera trajectory (encoded as an anchor video), it generates a high-fidelity reshoot that faithfully follows the new camera path while preserving original content, complex dynamics, and temporal consistency β€” trained entirely on in-the-wild monocular videos.

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
Morphic Inc. Β· arXiv:2604.21776


Source Video Reshot Video
Source video Reshot video

Model Files

This repository contains two LoRA checkpoints (rank-512, applied to attention and feed-forward layers of Wan2.2-I2V-A14B):

File Role Notes
jan06_scaling_80k_ckpt1400.safetensors High-noise expert Controls early denoising steps. Primarily responsible for camera motion alignment and global scene structure. Trained on ~80k clips with scaling augmentations + 15% synthetic data mixture.
dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors Low-noise expert Controls late denoising steps. Responsible for texture fidelity and fine detail. Uses standard black-background anchors, no source reconstruction loss.

Both files are ~9.82 GB each.


Quickstart

1. Clone the repository

git clone https://github.com/morphicfilms/video-to-video.git
cd video-to-video

Follow the Wan2.2 installation guide to set up the environment, or run:

bash setup_env.sh

2. Download the weights

Download the Wan2.2 I2V base weights:

huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B

Download the Reshoot-Anything LoRA weights:

huggingface-cli download morphic/reshoot-anything --local-dir ./reshoot-anything-weights

3. Prepare your anchor video

At inference, generate an anchor video by converting your source video to a 4D point cloud, applying the target camera trajectory, and forward-warping to produce the geometric anchor. See the repo's anchor_generation/ scripts for details.

4. Run reshooting

torchrun --nproc_per_node=8 generate.py \
    --task v2v-A14B \
    --size 1280*720 \
    --frame_num 81 \
    --ckpt_dir ./Wan2.2-I2V-A14B \
    --high_noise_lora_path ./reshoot-anything-weights/jan06_scaling_80k_ckpt1400.safetensors \
    --low_noise_lora_path ./reshoot-anything-weights/dec23_v2v_lownoise_black_lora_512_ckpt1000.safetensors \
    --source_video examples/source.mp4 \
    --anchor_video examples/anchor.mp4 \
    --dit_fsdp \
    --t5_fsdp \
    --ulysses_size 8

Note: Refer to the GitHub README for the authoritative argument names and single-GPU usage.


How It Works

Reshoot-Anything adapts the Wan2.2-14B Mixture-of-Experts (MoE) DiT with two key architectural changes:

Dual-stream token conditioning β€” Both the anchor video V_a (geometric guide) and source video V_s (texture reference) are VAE-encoded and temporally concatenated as tokens into the model's main self-attention mechanism. This outperforms cross-attention for view synchronization by letting the model directly route textures across spatial and temporal positions.

Offset RoPE β€” A fixed temporal offset of 50 is added to source video token positional embeddings, strictly decoupling source context from the active denoising trajectory.

The model learns implicit 4D spatiotemporal routing β€” when a target frame requires content occluded in the corresponding source frame, the model locates and re-projects the missing texture from a different timestep in the source video.

Self-Supervised Training Pipeline

Training requires no paired multi-view data. From a single monocular video:

  1. Two independent smooth random-walk crop trajectories are sampled β†’ source V_s and target V_t
  2. V_s[0] is forward-warped via AllTracker dense flow + crop offset β†’ anchor V_a
  3. The triplet (V_s, V_a, V_t) forms the training signal

A hybrid dataset strategy augments the monocular pipeline with a 15% mixture of paired synthetic data from ReCamMaster, enabling generalization to extreme (120Β°+) orbital camera trajectories.


Training Details

Parameter Value
Base model Wan2.2-I2V-A14B (14B MoE)
LoRA rank 512 (attention + FFN)
Training steps 2,000 per expert
Batch size 24
Learning rate 1e-5
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.999)
Loss MSE + 0.1 Γ— L1 source reconstruction
Latent frames 20
Primary data ~100k clips from 30k monocular videos
Synthetic mixture 15% ReCamMaster paired clips

Key augmentations: 3D-aware noise injection into anchor reference frame (magnitude uniform [0, 0.5]), fluorescent pink masked-region backgrounds, random anchor reference frame selection, source token auxiliary reconstruction loss.


Evaluation

Results on 100 five-second videos from OpenSora-Mixkit (16fps, 480p):

Method CLIP-F ↑ RotErr ↓ TransErr ↓ Mat. Pix ↑ FVD-V ↓ CLIP-V ↑
ReCamMaster 98.49 11.29 19.59 1314.00 732.52 88.91
EX-4D 98.94 3.94 4.21 2188.98 685.63 89.77
TrajectoryCrafter (49f) 98.80 2.26 3.03 1851.80 582.56 92.40
Ours 99.03 2.76 4.23 2720.83 586.24 93.16
Ours (49f) 99.01 2.61 2.73 2737.65 488.22 94.96

Citation

@article{paliwal2026reshootanything,
  title={Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting},
  author={Paliwal, Avinash and Iyer, Adithya and Yadav, Shivin and Afridi, Muhammad Ali and Harikumar, Midhun},
  journal={arXiv preprint arXiv:2604.21776},
  year={2026}
}

License

Model weights are released under the Apache 2.0 license, consistent with the Wan2.2 base model.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for morphic/reshoot-anything

Adapter
(293)
this model

Papers for morphic/reshoot-anything