FLUX.2 [dev]
Generate or edit images from text and optional photos
I built an interactive studio that turns any photo into a collectible 3D figurine. I never opened Photoshop. I never opened Blender. Not once during the build, and not once when a figurine is actually generated. Every image is produced by one Hugging Face Space and every 3D model by another, both driven by an agent reading their agents.md.
Here it is, live, sign in and bring your own face:
This is the third post in a small series about treating multimedia models as callable building blocks. The first two built galleries. This one is different: the studio is a real app, and the end user runs the exact same pipeline the agent did, with their own prompt and their own photo.
A few years ago, making "a stylized 3D figurine of a person" meant a pipeline of heavy creative software. Retouch the photo in Photoshop. Sculpt or retopologize in Blender or ZBrush. Bake textures. Light a scene. Each step was its own application, its own file format, its own expertise.
The interesting shift is not that AI models can now do each of those steps. It is that you no longer integrate them as software at all. You call them. A state of the art image model and a state of the art image to 3D model are each just an HTTP endpoint with a documented contract. The agent does not import a library or manage weights or provision a GPU. It reads how to call the thing, then calls it.
So the creative stack collapses from "install and learn N applications" down to "describe what you want, and let an agent glue together a couple of model endpoints."
agents.md is the contract
Every Gradio Space on the Hub exposes a plain text agents.md that tells an agent exactly how to drive it:
curl https://huggingface.co/spaces/microsoft/TRELLIS.2/agents.md
returns the whole contract in one shot: the API schema, the call and poll endpoints, how to upload a file, and the auth hint.
API schema: GET .../gradio_api/info
Call endpoint: POST .../gradio_api/call/{endpoint} {"param_name": value, ...}
Poll result: GET .../gradio_api/call/{endpoint}/{event_id}
File inputs: POST .../gradio_api/upload -F "files=@file.ext"
Auth: Bearer $HF_TOKEN
No client library, no hardcoded integration. You can find this on any Gradio Space through its Agents button:
Point a coding agent at that link, give it an HF_TOKEN, and it can run the Space end to end. The payoff is chaining: the output of one Space becomes the input to the next.
The whole figurine pipeline is two endpoints in a row.
Photo to figurine portrait. black-forest-labs/FLUX.2-dev takes a reference photo plus a prompt and returns an identity preserving "collectible vinyl figurine" portrait. This is the step that used to be Photoshop. It is now one call with a reference image attached.
Portrait to 3D. microsoft/TRELLIS.2 takes that single image and returns a real textured 3D mesh (a .glb). This is the step that used to be Blender. It is now one call.
Here is a real run. A reference photo on the left, the figurine the pipeline returns on the right. No manual editing in between.

Because the two steps run in sequence, the studio shows the intermediate result while you wait: your photo, then the figurine portrait the moment FLUX returns it, then the spinning 3D model once TRELLIS finishes.
To prove the style holds across very different subjects, the studio ships with a cabinet of famous artworks, each one a single prompt through the same two Spaces:

The Girl with a Pearl Earring, the Birth of Venus, American Gothic, and the Son of Man. Same two Spaces, only the prompt changed.
In the first two posts, the agent was the only one calling the Spaces, and the output was a finished gallery. This time the app hands the pipeline to the visitor.
When you sign in with Hugging Face and upload a photo, the studio forwards your image to FLUX.2 and TRELLIS using your token, so the GPU work runs on your own quota. The app itself is mostly glue: a small server that reads the same agents.md contracts, plus a viewer to spin the result. There is no creative software on the backend, because there is no creative software anywhere in the loop.
That is the part I find most telling. The "studio" is not a 3D tool that happens to use AI. It is a thin wrapper around two model endpoints, and the wrapper is the only thing that needed to be written.
agents.md makes a Space trivial to drive, so an agent reaches for it instead of a model it would have to set up by hand.The future of a lot of multimedia software looks less like learning Photoshop and Blender, and more like connecting your agent to a prompt, and connecting that to the right image and 3D generation Spaces on the Hub.
# photo + prompt -> figurine portrait
curl https://huggingface.co/spaces/black-forest-labs/FLUX.2-dev/agents.md
# single image -> textured 3D mesh
curl https://huggingface.co/spaces/microsoft/TRELLIS.2/agents.md
Paste either link into your coding agent (Claude Code, etc.), set your HF_TOKEN, and ask it to build something. Or just open the studio, sign in, and turn your own face into a collectible. The full app, including the small server that hits those two agents.md endpoints, lives in the Space repo.
The building blocks are sitting on the Hub. The agents already know how to glue. You never had to open Blender.