Hybrid ACT+Diffusion β ALOHA Single-Arm (Left) β 40k steps
Custom HybridACTDiffusion policy: ACT visual encoder (ResNet18 + 4-layer Transformer, mean-pooled) feeds a Diffusion U-Net decoder (FiLM conditioning, DDPM training, DDIM 10-step inference). No VAE β diffusion handles multimodal action distributions directly.
This is the 40k-step retrain (workstream S004) matching S003's step count for direct architectural comparison vs the shipped ACT-40k baseline. For the initial 13.4k baseline, see JHeisler/aloha_solo_left_act_diffusion.
Architecture
Images (cam_high, cam_left_wrist) + State (dim=9)
β
βΌ
ACT Encoder (ResNet18 β 4-layer Transformer) β mean-pool β (B, 512) global cond vector
β
βΌ
Diffusion U-Net (DiffusionConditionalUnet1d, FiLM modulation, down_dims=(256,512))
β DDPM training (100 timesteps) / DDIM 10-step inference
βΌ
Action chunks (chunk_size=100, action_dim=9)
Training Config
| Field | Value |
|---|---|
| Architecture | HybridACTDiffusion (ACT encoder + Diffusion U-Net) β see lerobot/common/policies/hybrid_act_diffusion/ |
| Dataset | JHeisler/aloha_solo_left_4_6_26 β 50 episodes, 29,785 samples, 30 fps |
| State / action dim | 9 / 9 |
| Cameras | cam_high, cam_left_wrist (3Γ480Γ640 each) |
| Steps | 40,000 |
| Batch size | 28 (adaptive DOE winner β beats bs=24 by 6.8% throughput at 91.3 smpl/s) |
| Learning rate | 3.5e-5 (linear-scaled from bs=24's 3e-5) |
| Total samples seen | |
| AMP | enabled |
| torch.compile | enabled |
| Save freq | every 10,000 steps (10k / 20k / 30k / 40k checkpoints) |
| Diffusion scheduler | DDPM training (100 timesteps, squaredcos_cap_v2), DDIM at inference (10 steps) |
| Final loss (DDPM noise-pred MSE) | 0.003β0.007 |
| Final grad norm | ~0.10β0.18 |
| Wall clock | ~3h 53min on RTX A4500 |
| LeRobot pin | 96c7052777aca85d4e55dfba8f81586103ba8f61 (with custom hybrid_act_diffusion policy added) |
Project Lineage
| Workstream | Model | Steps | Samples | HF |
|---|---|---|---|---|
| S001 | ACT | 13,400 | 640K | act_left |
| S002 | Hybrid ACT+Diffusion | 13,400 | 321K | act_diffusion |
| S003 | ACT (shipped) | 40,000 | 1.92M | act_left_40k |
| S004 | Hybrid ACT+Diffusion | 40,000 | 1.12M | this repo |
S003 vs S004 is the apples-to-apples architectural comparison: same dataset, same step count, ACT-VAE vs ACT-Diffusion decoder.
Notes on loss comparability
DDPM noise-prediction MSE (this model) and ACT's L1+KL combo (S001/S003) are different loss surfaces β absolute loss values are NOT directly comparable across architectures. The right comparison is offline action L1 on held-out episodes or real-robot rollout success rate.
Usage
# Requires lerobot pinned to 96c7052 with hybrid_act_diffusion policy package added
from lerobot.common.policies.hybrid_act_diffusion.modeling_hybrid_act_diffusion import HybridACTDiffusionPolicy
policy = HybridACTDiffusionPolicy.from_pretrained("JHeisler/aloha_solo_left_act_diffusion_40k")
Citation / Course
EN.525.681 school project β JHU Whiting School of Engineering. Team: Jake Heisler, Laura Kroening, Purushottam Shukla.
Code reference: HuggingFace LeRobot at commit 96c7052 with custom hybrid policy package.
- Downloads last month
- 47