PhyCo: Learning Controllable Physical Priors for Generative Motion
CVPR 2026 · Sriram Narayanan¹², Ziyu Jiang², Srinivasa G. Narasimhan¹, Manmohan Chandraker²³ (¹ Carnegie Mellon University · ² NEC Labs America · ³ UC San Diego)
Project Page · arXiv · Paper (PDF) · Simulation Code (PhyCo-Sim) · Dataset
TL;DR. PhyCo learns controllable physical priors — friction, restitution, deformation, and force — from simple block-sliding and ball-bouncing simulations, enabling physically grounded and continuously controllable video generation without any simulator at inference.
Abstract
Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes — without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.
Model
phyco.pt is a single self-contained checkpoint built on the frozen NVIDIA Cosmos-Predict2-2B-Video2World DiT, with PhyCo's LoRA fine-tuning merged in and all three physics-conditioning ControlNet branches baked in:
| Branch | Physical property |
|---|---|
controlnet_1 |
friction / restitution (bounciness) |
controlnet_2 |
deformability |
controlnet_3 |
applied force / motion direction |
Given a single input frame and pixel-aligned physical-property maps, PhyCo generates a short video whose motion follows the specified physics. No simulator or geometry reconstruction is used at inference.
Usage
This checkpoint is consumed by the PhyCo video-generation / evaluation code (see the project page for the code release). This repo contains:
phyco.pt— the model checkpoint (base DiT + 3 ControlNet branches).physics_iq_segmentation_pkl/— per-scenario segmentation masks (198.pkl) used by the Physics-IQ benchmark JSON (physiq_controlnet_v1.json).
Download both and run the canonical generation command:
# checkpoint -> checkpoints/phyco/phyco.pt
hf download nnsriram97/phyco phyco.pt --local-dir checkpoints/phyco
# Physics-IQ segmentation masks -> assets/physics_iq_segmentation_pkl/
hf download nnsriram97/phyco --include "physics_iq_segmentation_pkl/*" --local-dir assets
# generate (single-file checkpoint; ControlNet branches are inside phyco.pt)
PYTHONPATH=$(pwd) bash scripts/launch/run_physprop_kubric_mpgu.sh \
--batch_input_json scripts/batch_jsons/benchmark/physiq_controlnet_v1.json \
--dit_path checkpoints/phyco/phyco.pt \
--controlnet_branch_names controlnet_1,controlnet_2,controlnet_3 \
--active_controlnets controlnet_1,controlnet_2,controlnet_3 \
--controlnet_branch_scales "1.0,1.0,1.0" --dynamic_controlnet \
--pipeline_config controlnet_multi_24fps_57frames \
--conditioning_type image_blob --blob_type circle \
--resolution 480 --fps 24 --total_seconds 5.0 \
--physprop_type friction_bounciness_deformable_force_move_dir \
--seed 42 --post_dit_subfolder physics-IQ
--dynamic_controlnet selects the relevant branch per scenario (friction/restitution → controlnet_1, deformability → controlnet_2, force → controlnet_3).
Intended use & limitations
- Intended use: research on physically grounded, controllable video generation; reproducing the paper's Physics-IQ results.
- Non-commercial: released under CC BY-NC-ND 4.0 (see License).
- Limitations: trained on Kubric-style simulation scenarios (block sliding, ball bouncing, etc.); behavior on out-of-distribution scenes is not guaranteed. The model inherits the capabilities and biases of its Cosmos-Predict2 base.
License
This model is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license. It is built on top of NVIDIA's Cosmos-Predict2, whose code/weights remain subject to their original Apache-2.0 / NVIDIA Open Model License terms.
Citation
@InProceedings{Narayanan_2026_CVPR,
author = {Narayanan, Sriram and Jiang, Ziyu and Narasimhan, Srinivasa and Chandraker, Manmohan},
title = {PhyCo: Learning Controllable Physical Priors for Generative Motion},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026},
pages = {41892-41902}
}
Model tree for nnsriram97/phyco
Base model
nvidia/Cosmos-Predict2-2B-Video2World