deep-plantain

LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329) on an open backbone.

The paper's claim

Vision Banana instruction-tunes Nano Banana Pro on five vision tasks β€” referring, semantic, and instance segmentation; metric depth; surface normals β€” by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality.

What this LoRA tests

One axis of the claim:

  • One task of the five (monocular depth)
  • Open base (FLUX.2 Klein 4B)
  • LoRA, not full instruction-tuning of the original training mixture

Method

Barron (2025) power transform (Ξ»=βˆ’3, c=10/3) maps metric depth to u ∈ [0, 1); u is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black β†’ blue β†’ cyan β†’ green β†’ yellow β†’ red β†’ magenta β†’ white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: decode_rgb_to_depth.py.

Demos

cat

Indoor, in-distribution. Depth ordering correct: cat ~1–2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).

beach

Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.

skier

Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.

Training

Base black-forest-labs/FLUX.2-klein-base-4B (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2)
Adapter LoRA, rank 32, alpha 32
Target modules Transformer attention to_k, to_q, to_v, to_out.0 (joint blocks) + to_qkv_mlp_proj and attn.to_out of all 24 single transformer blocks. Text encoder and VAE frozen.
Resolution 768 Γ— 768
Batch size 2 (no grad accumulation)
Optimizer AdamW, Ξ²=(0.9, 0.999), weight decay 1e-4
Learning rate 1e-4, cosine schedule, 150-step warmup
Steps 4 000 (snapshot of an in-progress 5 000-step run)
Samples seen ~8 000
Mixed precision bf16
Training data Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames
Depth encoding Barron 2025 power transform (Ξ»=βˆ’3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube
Hardware Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM)
Wall time ~5 hours

Status

A substantially better checkpoint is staged in pending/ β€” rank-256 + text-encoder LoRA trained on 58 k mixed Hypersim + NYU frames. NYU Eigen test (490 frames evaluated): 0.596 m RMSE / 0.745 Ξ΄1 / 0.163 AbsRel, roughly doubling Ξ΄1 and more than halving RMSE versus the rank-32 baseline (1.566 m / 0.370 / 0.461). On the 10 hardest NYU frames where the rank-32 baseline scored 3–4 m, the new checkpoint gets 0.436 m / 0.819 Ξ΄1. Vision Banana paper reference (full set, full instruction-tune of NBP): 0.948 Ξ΄1 / 0.074 AbsRel. See pending/README.md for load instructions; canonical root weights will be replaced once a later step beats this one.

Usage

from diffusers import Flux2KleinPipeline
import torch

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B",
    torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/deep-plantain")

prompt = (
    "Generate a metric depth visualization of this image. Color scheme: "
    "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
    "~8.7 m red, ~16.5 m magenta, far approaching white."
)

depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]

References

  • Gabeur, Long, Peng, et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
  • Barron, J. T. A Power Transform. arXiv:2502.10647 (2025).

License

The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.

Training data attributions

  • Hypersim (Roberts et al., 2021). Photorealistic synthetic indoor frames used as a portion of the training data. Licensed under the Apple ML Research License Agreement; non-commercial research use only. The 3D scenes underlying the rendered images were originally licensed from Evermotion. See https://github.com/apple/ml-hypersim for full terms.
  • NYU Depth V2 (Silberman et al., 2012). Real Kinect indoor frames used as a portion of the training data. Released for research use; see https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html.

Use of the LoRA adapter for purposes that would conflict with the source datasets' licenses (e.g., redistributing reconstructed Hypersim imagery commercially) is not authorised by the data holders. The adapter itself does not embed the source images, but downstream use that effectively reconstructs them inherits those constraints.

Base model

Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.

Downloads last month
40
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/deep-plantain

Adapter
(42)
this model

Collection including phanerozoic/deep-plantain

Papers for phanerozoic/deep-plantain