The Office Meets Silicon Valley

Community Article
Published June 15, 2026

Brad Did Something

Corporate bureaucracy is already a chaotic simulation — I just replaced middle management with an LLM to see if the company could survive the quarter.

Welcome to Brad Did Something, a 2D top-down office comedy game built for the An Adventure in Thousand Token Wood track of the Hugging Face Build Small Hackathon. You play the Head of Sales and Partnerships at Veloura Technologies. Your objective is deceptively simple: manage five completely unpredictable underlings across 15 workplace events, survive one fiscal quarter, and hit a $1,000,000 revenue goal.

It's The Office meets Silicon Valley, driven entirely by generative AI.

![Brad Did Something — Gameplay Trailer] ▶ Watch the full gameplay trailer — surviving a quarter at Veloura Technologies.

Actual gameplay: walking the office, a crisis fires, the revenue meter moves A real, playable loop — walk up to an underling, the crisis hits, you argue back, the numbers move.

Smashing the Static Dialogue Tree

In traditional top-down RPGs, player agency is an illusion gated by hardcoded dialogue trees. You pick option A, B, or C and get a pre-written response down a predictable track. I wanted an experience where the AI is entirely load-bearing — changing the mechanical state of the world in real time.

In Brad Did Something, every piece of NPC dialogue, every bizarre workplace reaction, and every fiscal consequence is generated live by the model.

The load-bearing mechanic: the LLM doesn't just write decorative flavor text. Through strict JSON-schema validation, the model's output directly drives the game's economy. Talk an underling into salvaging a problematic enterprise lead and your revenue jumps; offend them and your pipeline collapses. You literally have to argue your way to $1,000,000 — and because the server owns all the truth, the model never sees (and can't leak) the hidden morale and relationship scores it's secretly moving.

2D Walkabout Meets AI Comics

To anchor the chaos I built a classic 2D top-down walking simulator. But to keep the narrative punchy and reward players for surviving meltdowns, I added a second layer of visual comedy: single-panel, wordless crisis comics generated on the fly.

When an event triggers, the text model drafts a short scene description; a separate image model renders a comic panel that drops directly over the office floor — giving a literal face to the corporate madness right before the dialogue opens and you have to clean up the mess.

Comic reveal The text model writes the situation; the image model draws it. Same beat, two models.


Cramming the Chaos into < 32B (The AI Architecture)

I built the engine around the llama.cpp runtime (for the Llama Champion badge). Building a generative comedy game on a small model is a brutal trade-off: comedy needs fast timing (low latency), but game logic needs strict instruction-following (usually more parameters). Here's how I found the sweet spot.

        ┌─────────────────────────────────────────────────────────────┐
        │  PLAYER types a reply  ("Tell Brad to just close the deal")  │
        └───────────────────────────────┬─────────────────────────────┘
                                         ▼
        ┌─────────────────────────────────────────────────────────────┐
        │  PYTHON BACKEND  (gr.Server / FastAPI)  — the "guard"        │
        │  builds the prompt + JSON schema, validates everything that  │
        │  comes back, owns all hidden state, falls back if a call dies │
        └───────────────┬─────────────────────────────┬───────────────┘
                        ▼                               ▼
        ┌───────────────────────────┐   ┌─────────────────────────────┐
        │  BRAIN — Modal L4 GPU      │   │  ARTIST — Modal A10G GPU     │
        │  llama.cpp + Qwen3.5-9B    │   │  FLUX.2 [klein] 4B (4-step)  │
        │  → grammar-locked JSON     │   │  → one wordless comic panel  │
        └───────────────┬───────────┘   └─────────────┬───────────────┘
                        └───────────────┬─────────────┘
                                        ▼
        ┌─────────────────────────────────────────────────────────────┐
        │  Custom HTML5 Canvas + DOM UI  (no default Gradio widgets)    │
        │  validated JSON updates the economy; the panel drops on top   │
        └─────────────────────────────────────────────────────────────┘

The tiny-model problem

I wanted to see how small I could realistically go. The game needs the AI to simultaneously roleplay an unhinged corporate persona and return a perfectly formatted JSON object that updates the engine — at the same time.

I put NVIDIA's Nemotron-3-Nano-4B (GGUF) and other sub-4B models through their paces on the exact same llama.cpp + JSON-schema stack. They're genuinely impressive for their size, but they buckled under the dual cognitive load: even with grammar enforcement they would truncate the JSON mid-object, lose the plot inside the schema, or simply write thinner, less funny dialogue. In a fresh A/B this week, the 4B returned a malformed payload on roughly one call in five — and wasn't even faster, since its reasoning-trained verbosity ate the per-token speed advantage.

The harsh conclusion: to reliably run strict, schema-enforced roleplay, I needed something bigger than 4B. (So yes — I consciously gave up the Tiny Titan badge, in the same spirit as the project's other deliberate sacrifices.)

The 9B sweet spot and the Modal GPU

I stepped up to bartowski/Qwen_Qwen3.5-9B-GGUF and the failures stopped.

At 9 billion parameters, Qwen3.5 threads the needle: smart enough to understand the financial stakes and emit flawless JSON state updates, creative enough to write genuinely funny, unpredictable NPCs. One trick that matters — Qwen3.5 is a "thinking" model, so the inference path injects an empty <think></think> block and then hands control to the JSON grammar, which forces valid output from the very first token (no room for a reasoning ramble or a "Sure, here's the outcome:" preamble).

To keep the comic timing from being ruined by slow generation, I deployed the llama.cpp engine on a Modal L4 GPU. Modal keeps the function warm for five minutes between calls, so warm responses land in a few seconds — simple beats in ~3–4s, the big crisis generations under ~10s. You type, you hit enter, the NPC snaps back fast enough that it feels like a high-stress argument, not waiting on an email.

Modal logs showing warm llama.cpp generations Live llama.cpp inference on Modal — 9B reasoning at a few seconds warm, fast enough to keep the gameplay fluid.


Keeping the Comics On-Model (Without Burning Tokens)

The comic panels have to match one specific corporate-satire look every single time, or the whole bit falls apart. The naive approach is to make the text model describe the art style in every prompt — but that wastes its budget and the style still drifts.

Two design choices fixed it, and neither one is a fine-tune:

1. The style lives in the backend, not the model. Qwen only ever writes the situationimage_prompt is pure scene description ("Brad standing on a desk hurling staplers, coworkers ducking"). The Python backend silently prepends a fixed COMIC_STYLE art-direction string (flat ink outlines, ben-day halftone, warm daylight office, chibi workers…) before sending it to FLUX. The model spends 100% of its budget on the comedy; the look is guaranteed by code. This actually mattered: early on, the style string was inside the model's field and ate ~230 of its ~320 characters, truncating the scene down to a single garbled panel. Moving the prepend server-side gave the whole budget back to the joke.

2. The image is wordless; the UI draws the words. Image models render garbled gibberish text ("I clicket send"), so I forbid all text, letters, and speech bubbles in the panel. The model instead writes a separate, crisp comic_caption, which the game renders as real UI text in a banner above the picture. Best of both worlds: a clean illustration plus perfectly legible narration.

Under the hood it's FLUX.2 [klein] 4B (the fast 4-step distilled model) on a second Modal GPU (A10G), rendering a panel in ~5 seconds warm. And it's purely decorative: if the image model is cold, slow, or fails, the game shows no overlay and the dialogue just opens — the outcome is never blocked waiting on art.


Lessons from the Trail (What Actually Broke)

Building an LLM-driven game sounds magical until you try to run it. Models are stubbornly helpful, JSON is fragile, and latency kills jokes. The biggest hurdles:

Latency is gameplay (and comedy)

A punchline that takes 15 seconds to render isn't a joke; it's a loading screen. That's exactly why the shipped game wires straight to a warm Modal L4 GPU rather than running inference locally on CPU — CPU turns are minute-long, which would murder the timing. Keeping the function warm drops responses to a few seconds and the conversation feels like a snappy boardroom argument. If you're building a real-time AI game, latency isn't an optimization — it's the core mechanic.

Curing the "helpful AI" disease

Tell a 9B model to "act like an angry sales guy" and it'll usually hand you a sterilized, HR-approved version of anger wrapped in caveats. To make the underlings genuinely unhinged, the system prompts had to be aggressive — commit fully to the persona, stay in first person, never break character, never explain yourself like an assistant. And because the model still slips sometimes, the server-side validator is the enforcer: it rejects any output that narrates itself in the third person, leaks game mechanics, or drifts out of voice, then asks once more before falling back to a safe canned line. You aren't just writing a character bio — you're actively fighting the model's safety training to turn it into a fun video-game NPC, and then bringing a net in case it wins.

The JSON is only as strict as the grammar

Schema-enforced generation guarantees shape, not length — the grammar doesn't cap string fields, so a chatty model can overrun a budget and get cut off mid-string, producing unparseable JSON. The fixes were unglamorous and essential: a generous token ceiling, a one-shot corrective retry before any fallback, and moving long decorative fields (like the comic prompt) out of the model's critical path. Reliability in an AI game is mostly plumbing.


Play It Now

Brad Did Something is live and fully playable right now. See if you can survive the quarter or if your underlings bankrupt the company by Tuesday.

🎮 Play the game: https://huggingface.co/spaces/build-small-hackathon/brad-did-something 🍿 Watch the trailer: https://youtu.be/BJSE5WDZvPs 💻 Source code: https://huggingface.co/spaces/build-small-hackathon/brad-did-something/tree/main

Community

Sign up or log in to comment