No Photoshop, No Blender: Multimedia by Agent

Published June 19, 2026

An agent built a 3D figurine studio, and now anyone can mint a collectible from one photo. No creative software was opened at any point.

I built an interactive studio that turns any photo into a collectible 3D figurine. I never opened Photoshop. I never opened Blender. Not once during the build, and not once when a figurine is actually generated. Every image is produced by one Hugging Face Space and every 3D model by another, both driven by an agent reading their agents.md.

Here it is, live, sign in and bring your own face:

👉 mishig/figurine-factory

This is the third post in a small series about treating multimedia models as callable building blocks. The first two built galleries. This one is different: the studio is a real app, and the end user runs the exact same pipeline the agent did, with their own prompt and their own photo.

The old multimedia stack was tools. The new one is endpoints.

A few years ago, making "a stylized 3D figurine of a person" meant a pipeline of heavy creative software. Retouch the photo in Photoshop. Sculpt or retopologize in Blender or ZBrush. Bake textures. Light a scene. Each step was its own application, its own file format, its own expertise.

The interesting shift is not that AI models can now do each of those steps. It is that you no longer integrate them as software at all. You call them. A state of the art image model and a state of the art image to 3D model are each just an HTTP endpoint with a documented contract. The agent does not import a library or manage weights or provision a GPU. It reads how to call the thing, then calls it.

So the creative stack collapses from "install and learn N applications" down to "describe what you want, and let an agent glue together a couple of model endpoints."

`agents.md` is the contract

Every Gradio Space on the Hub exposes a plain text agents.md that tells an agent exactly how to drive it:

curl https://huggingface.co/spaces/microsoft/TRELLIS.2/agents.md

returns the whole contract in one shot: the API schema, the call and poll endpoints, how to upload a file, and the auth hint.

API schema:    GET  .../gradio_api/info
Call endpoint: POST .../gradio_api/call/{endpoint} {"param_name": value, ...}
Poll result:   GET  .../gradio_api/call/{endpoint}/{event_id}
File inputs:   POST .../gradio_api/upload -F "files=@file.ext"
Auth:          Bearer $HF_TOKEN

No client library, no hardcoded integration. You can find this on any Gradio Space through its Agents button:

The Agents button on a Hugging Face Space, showing the agents.md instructions

Point a coding agent at that link, give it an HF_TOKEN, and it can run the Space end to end. The payoff is chaining: the output of one Space becomes the input to the next.

The pipeline: one photo, two Spaces

The whole figurine pipeline is two endpoints in a row.

Photo to figurine portrait. black-forest-labs/FLUX.2-dev takes a reference photo plus a prompt and returns an identity preserving "collectible vinyl figurine" portrait. This is the step that used to be Photoshop. It is now one call with a reference image attached.
Portrait to 3D. microsoft/TRELLIS.2 takes that single image and returns a real textured 3D mesh (a .glb). This is the step that used to be Blender. It is now one call.

Here is a real run. A reference photo on the left, the figurine the pipeline returns on the right. No manual editing in between.

Input reference photo Generated figurine portrait

Because the two steps run in sequence, the studio shows the intermediate result while you wait: your photo, then the figurine portrait the moment FLUX returns it, then the spinning 3D model once TRELLIS finishes.

The studio showing the figurine portrait while the 3D model is being sculpted

To prove the style holds across very different subjects, the studio ships with a cabinet of famous artworks, each one a single prompt through the same two Spaces:

Girl with a Pearl Earring figurine The Birth of Venus figurine American Gothic figurine The Son of Man figurine

The Girl with a Pearl Earring, the Birth of Venus, American Gothic, and the Son of Man. Same two Spaces, only the prompt changed.

The twist: the user runs the pipeline too

In the first two posts, the agent was the only one calling the Spaces, and the output was a finished gallery. This time the app hands the pipeline to the visitor.

When you sign in with Hugging Face and upload a photo, the studio forwards your image to FLUX.2 and TRELLIS using your token, so the GPU work runs on your own quota. The app itself is mostly glue: a small server that reads the same agents.md contracts, plus a viewer to spin the result. There is no creative software on the backend, because there is no creative software anywhere in the loop.

That is the part I find most telling. The "studio" is not a 3D tool that happens to use AI. It is a thin wrapper around two model endpoints, and the wrapper is the only thing that needed to be written.

Why this is the shape of things

The creative stack is collapsing into endpoints. "Stylize a portrait" and "reconstruct a 3D mesh" used to be separate applications with separate skill trees. They are now two HTTP calls an agent can chain in minutes.
Open weights make it composable. FLUX.2 and TRELLIS.2 come from different labs, and they chain with zero integration code because both are reachable through the same documented interface.
Agents prefer what is documented and reachable. agents.md makes a Space trivial to drive, so an agent reaches for it instead of a model it would have to set up by hand.
The marginal cost of a multimedia app falls toward the cost of describing it. A new figurine is a prompt. A whole new pipeline is a conversation.

The future of a lot of multimedia software looks less like learning Photoshop and Blender, and more like connecting your agent to a prompt, and connecting that to the right image and 3D generation Spaces on the Hub.

Try it

# photo + prompt -> figurine portrait
curl https://huggingface.co/spaces/black-forest-labs/FLUX.2-dev/agents.md
# single image -> textured 3D mesh
curl https://huggingface.co/spaces/microsoft/TRELLIS.2/agents.md

Paste either link into your coding agent (Claude Code, etc.), set your HF_TOKEN, and ask it to build something. Or just open the studio, sign in, and turn your own face into a collectible. The full app, including the small server that hits those two agents.md endpoints, lives in the Space repo.

The building blocks are sitting on the Hub. The agents already know how to glue. You never had to open Blender.

Spaces mentioned in this article 3

36 Prompts, One Infinite City

June 10, 2026

How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces

June 9, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

No Photoshop, No Blender: Multimedia by Agent

The old multimedia stack was tools. The new one is endpoints.

`agents.md` is the contract

The pipeline: one photo, two Spaces

The twist: the user runs the pipeline too

Why this is the shape of things

Try it

Spaces mentioned in this article 3

FLUX.2 [dev]

TRELLIS.2

Figurine Factory

36 Prompts, One Infinite City

How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces

Community

Spaces mentioned in this article 3

FLUX.2 [dev]

TRELLIS.2

Figurine Factory

No Photoshop, No Blender: Multimedia by Agent

The old multimedia stack was tools. The new one is endpoints.

agents.md is the contract

The pipeline: one photo, two Spaces

The twist: the user runs the pipeline too

Why this is the shape of things

Try it

Spaces mentioned in this article 3

36 Prompts, One Infinite City

How an Agent Built a 3D Paris Gallery by Chaining Two Hugging Face Spaces

Community

Spaces mentioned in this article 3

`agents.md` is the contract