deep-plantain
LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329) on an open backbone.
The paper's claim
Vision Banana instruction-tunes Nano Banana Pro on five vision tasks β referring, semantic, and instance segmentation; metric depth; surface normals β by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality.
What this LoRA tests
One axis of the claim:
- One task of the five (monocular depth)
- Open base (FLUX.2 Klein 4B)
- LoRA, not full instruction-tuning of the original training mixture
Method
Barron (2025) power transform (Ξ»=β3, c=10/3) maps metric depth to u β [0, 1); u is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black β blue β cyan β green β yellow β red β magenta β white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: decode_rgb_to_depth.py.
Demos
Indoor, in-distribution. Depth ordering correct: cat ~1β2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).
Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.
Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.
Training
| Base | black-forest-labs/FLUX.2-klein-base-4B (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2) |
| Adapter | LoRA, rank 32, alpha 32 |
| Target modules | Transformer attention to_k, to_q, to_v, to_out.0 (joint blocks) + to_qkv_mlp_proj and attn.to_out of all 24 single transformer blocks. Text encoder and VAE frozen. |
| Resolution | 768 Γ 768 |
| Batch size | 2 (no grad accumulation) |
| Optimizer | AdamW, Ξ²=(0.9, 0.999), weight decay 1e-4 |
| Learning rate | 1e-4, cosine schedule, 150-step warmup |
| Steps | 4 000 (snapshot of an in-progress 5 000-step run) |
| Samples seen | ~8 000 |
| Mixed precision | bf16 |
| Training data | Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames |
| Depth encoding | Barron 2025 power transform (Ξ»=β3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube |
| Hardware | Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM) |
| Wall time | ~5 hours |
Status
A substantially better checkpoint is staged in pending/ β rank-256 + text-encoder LoRA trained on 58 k mixed Hypersim + NYU frames. NYU Eigen test (490 frames evaluated): 0.596 m RMSE / 0.745 Ξ΄1 / 0.163 AbsRel, roughly doubling Ξ΄1 and more than halving RMSE versus the rank-32 baseline (1.566 m / 0.370 / 0.461). On the 10 hardest NYU frames where the rank-32 baseline scored 3β4 m, the new checkpoint gets 0.436 m / 0.819 Ξ΄1. Vision Banana paper reference (full set, full instruction-tune of NBP): 0.948 Ξ΄1 / 0.074 AbsRel. See pending/README.md for load instructions; canonical root weights will be replaced once a later step beats this one.
Usage
from diffusers import Flux2KleinPipeline
import torch
pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-base-4B",
torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/deep-plantain")
prompt = (
"Generate a metric depth visualization of this image. Color scheme: "
"0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
"~8.7 m red, ~16.5 m magenta, far approaching white."
)
depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
References
- Gabeur, Long, Peng, et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
- Barron, J. T. A Power Transform. arXiv:2502.10647 (2025).
License
The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.
Training data attributions
- Hypersim (Roberts et al., 2021). Photorealistic synthetic indoor frames used as a portion of the training data. Licensed under the Apple ML Research License Agreement; non-commercial research use only. The 3D scenes underlying the rendered images were originally licensed from Evermotion. See https://github.com/apple/ml-hypersim for full terms.
- NYU Depth V2 (Silberman et al., 2012). Real Kinect indoor frames used as a portion of the training data. Released for research use; see https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html.
Use of the LoRA adapter for purposes that would conflict with the source datasets' licenses (e.g., redistributing reconstructed Hypersim imagery commercially) is not authorised by the data holders. The adapter itself does not embed the source images, but downstream use that effectively reconstructs them inherits those constraints.
Base model
Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.
- Downloads last month
- 40
Model tree for phanerozoic/deep-plantain
Base model
black-forest-labs/FLUX.2-klein-base-4B

