deep-plantain

LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329) on an open backbone.

The paper's claim

Vision Banana instruction-tunes Nano Banana Pro on five vision tasks — referring, semantic, and instance segmentation; metric depth; surface normals — by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality.

What this LoRA tests

One axis of the claim:

One task of the five (monocular depth)
Open base (FLUX.2 Klein 4B)
LoRA, not full instruction-tuning of the original training mixture

Method

Barron (2025) power transform (λ=−3, c=10/3) maps metric depth to u ∈ [0, 1); u is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black → blue → cyan → green → yellow → red → magenta → white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: decode_rgb_to_depth.py.

Demos

Indoor, in-distribution. Depth ordering correct: cat ~1–2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).

Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.

Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.

Training


Base	`black-forest-labs/FLUX.2-klein-base-4B` (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2)
Adapter	LoRA, rank 32, alpha 32
Target modules	Transformer attention `to_k`, `to_q`, `to_v`, `to_out.0` (joint blocks) + `to_qkv_mlp_proj` and `attn.to_out` of all 24 single transformer blocks. Text encoder and VAE frozen.
Resolution	768 × 768
Batch size	2 (no grad accumulation)
Optimizer	AdamW, β=(0.9, 0.999), weight decay 1e-4
Learning rate	1e-4, cosine schedule, 150-step warmup
Steps	4 000 (snapshot of an in-progress 5 000-step run)
Samples seen	~8 000
Mixed precision	bf16
Training data	Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames
Depth encoding	Barron 2025 power transform (λ=−3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube
Hardware	Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM)
Wall time	~5 hours

Status

A substantially better checkpoint is staged in pending/ — rank-256 + text-encoder LoRA trained on 58 k mixed Hypersim + NYU frames. NYU Eigen test (490 frames evaluated): 0.596 m RMSE / 0.745 δ1 / 0.163 AbsRel, roughly doubling δ1 and more than halving RMSE versus the rank-32 baseline (1.566 m / 0.370 / 0.461). On the 10 hardest NYU frames where the rank-32 baseline scored 3–4 m, the new checkpoint gets 0.436 m / 0.819 δ1. Vision Banana paper reference (full set, full instruction-tune of NBP): 0.948 δ1 / 0.074 AbsRel. See pending/README.md for load instructions; canonical root weights will be replaced once a later step beats this one.

Usage

from diffusers import Flux2KleinPipeline
import torch

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B",
    torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/deep-plantain")

prompt = (
    "Generate a metric depth visualization of this image. Color scheme: "
    "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
    "~8.7 m red, ~16.5 m magenta, far approaching white."
)

depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]

References

Gabeur, Long, Peng, et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
Barron, J. T. A Power Transform. arXiv:2502.10647 (2025).

License

The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.

Training data attributions

Hypersim (Roberts et al., 2021). Photorealistic synthetic indoor frames used as a portion of the training data. Licensed under the Apple ML Research License Agreement; non-commercial research use only. The 3D scenes underlying the rendered images were originally licensed from Evermotion. See https://github.com/apple/ml-hypersim for full terms.
NYU Depth V2 (Silberman et al., 2012). Real Kinect indoor frames used as a portion of the training data. Released for research use; see https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html.

Use of the LoRA adapter for purposes that would conflict with the source datasets' licenses (e.g., redistributing reconstructed Hypersim imagery commercially) is not authorised by the data holders. The adapter itself does not embed the source images, but downstream use that effectively reconstructs them inherits those constraints.

Base model

Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.