Build Small, Deploy Big Panic: How I Built DOD - Deploy or Draw for the Build Small Hackathon
This is less of a polished "perfect case study" and more of a technical field diary: what worked, what broke in weird ways, and the decisions I made while trying to turn a slightly absurd idea into a playable game.
Project links:
- Demo: DOD UNO on Hugging Face Spaces
- Official repository: DEVAIEXP/doduno
- Demo video: Watch on YouTube
- Agent traces dataset: build-small-hackathon/dod-agent-traces
1. The Idea That Survived The Prototype
DOD - Deploy or Draw started with a simple question, the kind that shows up out of nowhere: what if a production incident became a card game?
The UNO structure helped because most people understand the basic loop quickly: match a color, match a card type, draw if you cannot play, shout when you have one card left. The software-engineering theme added the humor layer: refactoring code, copying StackOverflow, dropping the production database, deploying on Friday evening, getting pulled into surprise meetings, and other tiny disasters that feel familiar to anyone who builds software.
But the game only started feeling interesting when AI stopped being decorative and became load-bearing.
Nemotron is not a "generate text" button. It is a required player. It receives the current table state, evaluates its own hand, decides whether to play or draw, and when it plays a wild card, it also chooses the next stack.
The IT Director is not just a random message either. He reacts to the card that was played, considers the current crisis, and generates a very short quote in English and Portuguese. With VoxCPM2, those quotes become audio, turning the match log into a kind of corporate arcade table.
The main lesson was pretty direct: in a small AI experience, the model needs to participate in the tension. If AI only decorates the interface, it feels replaceable. When it plays, reacts, and changes the rhythm of the match, it becomes part of the toy.
2. Small Models Work Better When The Problem Is Well Bounded
Nemotron Nano 4B worked best when the task was reduced to a small, structured decision.
The backend does not ask the model to "understand UNO" from scratch. Before calling the LLM, Python computes which cards are playable using the authoritative game rules. Each card in the bot's hand receives a flag:
{
"index": 2,
"stack": "blue",
"category": "FIX",
"playable": true,
"res": 15,
"panic": -5
}
The bot prompt makes it clear that it can only choose cards with "playable": true. The expected response is also small:
{
"action": "PLAY",
"card_index": 2,
"chosen_color": "none"
}
For wild cards, the contract includes the selected color:
{
"action": "PLAY",
"card_index": 0,
"chosen_color": "green"
}
That was a turning point. The model stopped trying to solve a broad logical search and started choosing inside a controlled space. The CPU handles the boring deterministic part. The LLM gets the strategic and expressive part.
This became one of the most important lessons in the project: small models do not need to feel weak if the system around them is designed well. They perform better inside a carefully built harness, with validation, compact context, and a narrow task.
3. The Backend Must Stay Authoritative
Even with a prompt and schema, the model can choose poorly. It can try to draw even when it has a playable card. It can choose an invalid card. It can time out. It can return an empty response. It can hit a quota limit.
So the game does not blindly trust the LLM.
The current flow is:
- The backend computes playable cards.
- The LLM tries to decide.
- The response is extracted and validated as JSON.
- If the action is invalid, the backend ignores it.
- If Nemotron chooses to draw while a playable card exists, the backend rejects that decision to preserve the match rules.
- If the remote call fails because of timeout, quota, or downtime, a local fallback takes over only for that move so the match does not die.
The main path is still Nemotron. It is what gives the opponent personality and behavior. The local fallback is more like a seatbelt: it does not replace the model's role in the experience, it just prevents an infrastructure failure from becoming a fatal match error.
That difference matters. The local fallback can do the minimum required: find a legal card, play the first safe option, or draw when no play is possible. It does not interpret the crisis, weigh panic versus resolution, choose a card with strategic intent, or make the opponent feel like it is reading the table. Nemotron, on the other hand, receives the match state, the active card, the effects of available choices, and can choose between playable alternatives while considering risk, crisis progress, and opportunity. For wild cards, it can also return chosen_color, making a choice that the fallback only approximates with the dominant color in hand.
In other words: the fallback keeps the rules standing; Nemotron creates the opponent.
The same applies to the IT Director. When the LLM is available, quotes are generated from the played card, the effect of that move, the current crisis, and recent lines. This creates variety and gives the Director a more human presence, as if he is reacting to the table in real time. When generation fails, the game uses fixed crisis quotes as fallback. They preserve the scene and avoid silence, but they are naturally more repetitive and less specific. The fallback keeps the scene functioning; the LLM gives it dynamism, context, and personality.
That changes the experience completely. Instead of "the game broke because the endpoint failed", the match continues. The user may not even notice a fallback happened. In AI games, the model can be creative and central, but the system still needs to be robust around it.
4. JSON Schema Helps, But It Does Not Replace Validation
Both the bot and the Director are called with JSON schemas. This greatly reduces conversational responses, accidental markdown, and reasoning leakage.
But in practice, the app still needed safeguards:
- JSON extraction when the model returns a markdown block;
- removal of wrappers like
</think>; - validation of required fields;
- fallback when the response does not start and end with
{}; - additional validation of playable cards;
- lexical validation for Director quotes.
An important lesson: a schema is a fence, not a perfect prison. For an interactive demo, especially with small models, a second validation layer is absolutely worth it.
5. The Director Needs Context, But Too Much Context Becomes An Obsession
At first, the Director quotes were funny but loosely connected to the match. Later, the payload started including:
- played card;
- card type;
- card effect;
- card name and feedback in English and Portuguese;
- resolution and panic before/after;
- current crisis;
- a short history of recent Director quotes.
That improved the output a lot. The Director started referring to "corrupted database", "Hub traffic", "leaked AWS keys", or "AI hallucinating" in a much more contextual way.
But a new problem appeared: sometimes the model became too attached to the crisis and treated every card as if it had solved the whole incident. To mitigate that, the prompt gained rules like:
- do not say the entire crisis was solved if resolution is still low;
- do not copy recent quotes;
- do not ignore the card just to talk about the crisis;
- use the crisis when it helps, but keep the played card at the center of the reaction.
This was a subtle lesson, and honestly a little counterintuitive: too much context can become an obsession. For small models, good context is short, labeled, and paired with clear limits.
6. Technical Portuguese Needs A Glossary, Not Just "Write Natural Portuguese"
The first Portuguese quotes showed a common limitation of small bilingual setups. The model tried to translate technical terms literally:
- "frontend" became "front" in the wrong sense or even "frente";
- "revert" became an unnatural invented word;
- "database" could become awkward;
- "fix" could be translated as an adjective instead of a bug fix.
The solution was to treat technical Brazilian Portuguese as part of the prompt, not as a post-translation step. It sounds like a small detail, but it really was not. The Director prompt gained:
- an allowed-terms glossary;
- forbidden literal translations;
- few-shot examples;
- very short quotes;
- emotional separation between good and bad cards;
- validation of generated phrases;
- fallback to hand-written crisis quotes.
Even then, the system does not try to make the model perfect. It tries to block the ugliest errors and keep the experience fun. In a game demo, an unexpected line can be funny; a broken translation that destroys the context needs to be filtered.
7. A Single LLM Queue Prevents Collisions
The game has two main kinds of LLM tasks:
- bot decisions;
- Director quotes.
If these tasks ran in parallel without control, they would compete for endpoint capacity, CPU/GPU, quota, and event ordering. The current solution is a single FIFO queue (llm_queue) processed by one sequential worker.
That sounds simple, almost like "just put a queue there", but it solved multiple classes of problems:
- avoids multiple heavy calls at the same time;
- preserves the logical order of actions;
- reduces the risk of the bot and Director fighting over resources;
- keeps the main app responsive;
- makes fallback and error logging easier.
In small systems, a well-placed queue can be more valuable than a complex architecture.
8. A Background Worker Loses Request Context
Once work moved to the background, an important detail appeared: the worker does not automatically have the original gr.Request.
This became critical with Hugging Face ZeroGPU. When one Space calls another Space through gradio_client, it may need to forward the original user's x-ip-token header. Without it, the inference Space can treat the call as missing quota context.
The issue became clearer when the Spaces started running inside an organization. While I was testing endpoints associated with my own profile, the flow seemed to inherit my quota better, or at least did not expose the problem as clearly. Once the topology moved into an organization, the Space-to-Space call no longer carried that user context implicitly, and a quota error appeared even though the public endpoints were reachable.
The fix was to carry the context explicitly:
- capture
x-ip-tokenfrom the request; - associate the token with the player;
- put the token into the LLM task;
- create the
Clientwithheaders={"x-ip-token": token}.
Lesson: if an asynchronous task depends on the user, the user context needs to travel with the task. You cannot assume the framework will magically guess that later.
9. TTS Is Where The Experience Breaks Most Easily
Audio was one of the hardest parts of the project, and I definitely underestimated it at the start.
Problems that appeared:
- slow local TTS accumulating requests;
- audio arriving after the match already ended;
- old lines playing in the next match;
- local API overload;
- remote endpoint cold starts;
- differences between REST and Gradio APIs;
- bot audio colliding with player audio;
- replay out of context after victory or defeat.
The final solution combines several protections:
- a separate TTS queue (
tts_audio_queue); - sequential synthesis to avoid overloading the service;
audio_generation_idto discard work from old matches;- per-player queues in
pending_audios; - audio delivery through state polling;
- discard late audio after the match ends;
DOD_DISABLE_TTS=Truefor fast development;- keep Director text even when audio is disabled.
Modal became an important piece because TTS is heavy. Running VoxCPM2 locally together with the LLM can saturate VRAM, delay voice delivery, and even slow bot responses. With a TTS server on Modal, the game can keep the voice experience without requiring the local machine to load everything at once. In practice, this created three useful modes: local TTS for users who want everything on their own GPU, remote Modal TTS for a more stable demo, and DOD_DISABLE_TTS=True for fast development when Director text is enough.
Even then, Modal and Spaces can still cold start. That is why the game gained a warmup stage before starting a match. When the lobby reaches the minimum condition to begin, the backend sends a light call to TTS and another one to the LLM. Those calls use the same endpoint system from inference_map.json when the mapper is active, or local endpoints when DOD_USE_LOCAL_API=True. The match is only released once the services respond, or if something fails, the lobby shows feedback and retries on the next opportunity. This prevents the first user move from paying the entire cold-start cost alone.
The lesson was direct: in games, delayed audio can be worse than missing audio. If a line arrives after its context is dead, it breaks the experience.
10. Gradio Custom HTML Was The Right Path
It would have been possible to build the whole interface with standard Gradio components. But it would not have felt like a game.
The project gained identity when the board became a custom component:
Board(gr.HTML);html_template;css_template;js_on_load;watch('value', ...);- Python calls through
server_functions; - custom toast with
NeonToast; - audio, mute/unmute, and browser interactions handled directly in JavaScript.
This made it possible to build real cards, animations, spectator mode, color selection, a visual log, and match feedback without turning everything into a chain of Gradio buttons.
Lesson: Gradio can go far beyond forms. But to do that well, you need to respect the framework model and encapsulate the bridge between JavaScript and Python carefully.
11. Iframes, Viewports, And Fixed Layout
Running on Hugging Face Spaces has a particular constraint: the app lives inside an iframe. In earlier phases, viewport-dependent heights could cause unwanted layout growth.
The current board uses a stable 750px height in .game-layout, and the app handles the background canvas carefully to avoid resize loops.
That detail looks small, but it changes UI stability. In a game, a layout that jumps, grows, or forces unexpected scroll makes the user feel like the rules broke.
12. Timers Need Different Priorities
The app uses polling because it is the most stable way to synchronize multiple browsers in Gradio.
But not every piece of data needs the same frequency:
- player state: 1 second;
- spectator state: 2 seconds;
- backend tick: 1 second;
- leaderboard: 15 seconds;
- lobby/warmup: 1 second.
This avoids unnecessary traffic and reduces event conflicts. It was also important to add show_progress=False to Gradio events so constant polling did not turn the UI into a tree of spinners.
After testing with other people, one rhythm problem became obvious in a way I would not have noticed playing alone: Nemotron was too fast. In a table with more players, Director audio could arrive after the next turn had already moved on, creating a queue of voice lines instead of a table conversation. The fix was a small turn handoff delay with DOD_TURN_HANDOFF_DELAY_SECONDS, plus an extra bot multiplier through DOD_BOT_TURN_HANDOFF_MULTIPLIER. During that short interval, the next player does not see their hand as playable yet. It feels closer to a real UNO table: the previous player is still finishing their move, instead of the interface obviously saying "wait, the system is blocking you".
Lesson: in an interactive app, polling is not the problem. Polling without priorities is.
13. The Lobby Was Harder Than The Game
The card logic is complex, but the lobby produced the most slippery bugs.
Cases that needed handling:
- Nemotron is always required;
DOD_MAX_PLAYERSincludes the bot;- with
DOD_MAX_PLAYERS=2, human + Nemotron starts without a countdown; - with
DOD_MAX_PLAYERS>=3, the lobby can wait for more humans; DOD_MIN_PLAYERS_TO_STARTdefines the minimum number of seats required before the match can start;- once that minimum is reached,
DOD_LOBBY_START_COUNTDOWN_SECONDSstarts a countdown so people do not wait forever for a full room that may never happen; - the first player can trigger warmup;
- the second player cannot enter the match if warmup has already reserved the room;
- queued users must stay in the lobby;
- the "Leave Queue" button must never remain visible out of context;
- the queue must rotate correctly after the match ends;
- a promoted player must receive the player board, not the spectator board.
The biggest lesson here: multiplayer starts before the match. The lobby is a critical part of the product, not just a little waiting screen.
The minimum-player rule was important for the experience. If DOD_MAX_PLAYERS=4, for example, the game does not need to stay blocked forever waiting for four participants. Once the configured minimum is reached, the lobby shows a countdown. If more players join before it ends, they enter the match up to the limit. If nobody else appears, the match starts with the available players and later users go into the queue for the next round.
14. Authentication Is Not Match State
Many bugs appeared because identity, lobby, and match state looked like the same thing.
They are different states:
- being authenticated with Hugging Face;
- being in the lobby;
- being in the queue;
- playing;
- watching as a spectator;
- leaving the match;
- logging out.
The most important point was preserving hf_user_id as login identity, not as match state. Leaving a match does not mean leaving Hugging Face. Ending a match should not make the manual name field reappear if the user is still authenticated.
It was also important to separate HF_TOKEN_DATASET from HF_TOKEN. The app uses HF_TOKEN_DATASET for private datasets and removes HF_TOKEN from the environment so it does not break Gradio local/mock OAuth.
Lesson: OAuth is an identity layer. The match is another layer. Mixing them creates very strange bugs.
15. A Leaderboard Needs Strong Identity
At first, a leaderboard based on the typed name seemed sufficient. But it was unsafe:
- anyone could type someone else's name;
- wins could be assigned incorrectly;
- guests could pollute the official ranking;
- Nemotron wins in unauthenticated matches could count without context.
The current rule is safer:
- only Hugging Face authenticated users update the official leaderboard;
- Nemotron only records results when at least one authenticated human is in the match;
- guests can play, but they do not change the official ranking;
- profile pictures can appear on the leaderboard and opponent cards.
Lesson: a typed name is great for casual fun, but not for a public ranking.
16. Local And Remote Need To Be First-Class Citizens
The project needed to work in several configurations:
- everything local;
- local LLM and TTS disabled;
- local LLM and remote TTS;
- remote endpoints through a dataset;
- local leaderboard;
- remote leaderboard through a Hugging Face Dataset;
- public Space using hosted endpoints;
- TTS on Modal;
- primary/fallback routing.
That led to inference_mapper.py, with independent endpoint chains for LLM and TTS:
LLM_URL_PRIORITY=primary|fallback;TTS_URL_PRIORITY=primary|fallback;DOD_USE_LOCAL_API=True;DOD_USE_LOCAL_DATA=True;- per-endpoint timeouts;
- per-endpoint cooldowns.
One important decision was putting inference_map.json in a private Hugging Face Dataset instead of depending only on app environment variables. Environment variables are great for local execution and secrets, but in a published Space they are quite static: changing a URL usually means editing settings, restarting, or redeploying. With the mapper in a dataset, LLM and TTS URLs become live data. I can change primary, fallback, timeout, or priority in the JSON at any time, and the app sees the new route on the next cache refresh and subsequent calls.
This was especially useful because endpoints change a lot during development: a local TTS tunnel, a remote Modal app, a personal LLM Space, then an organization Space, then a fallback. The game does not need to be taken down just because the infrastructure around it changed.
The lesson was that a public demo needs to be simple for judges, but real development needs a reliable local mode. Both modes need to be part of the architecture, not patched in at the last minute.
17. Fallback Is Not A Luxury, It Is Structure
The project became a system of fallbacks:
- LLM primary/fallback;
- TTS primary/fallback;
- local bot fallback;
- fixed crisis quote fallback;
- empty leaderboard when CSV/dataset does not exist;
- local or remote data;
- disabled TTS;
- quiet console with
DOD_DISABLE_LOGS=True.
This may look like too much, but it is what made the app playable in real environments. External services fail, quotas run out, GPUs get saturated, local tunnels drop, remote endpoints sleep. The game needs to continue or at least fail in a way that makes sense.
18. Agent Traces Help Explain The AI Part
Late in the project I added one more piece that I wish I had built earlier: agent traces.
For a game like DOD, it is easy to say "the AI is playing", but it is much better to show what that means. So the app now writes lightweight JSONL records for the two places where the model is actually load-bearing:
- Nemotron turns, including the active card, crisis metrics, Nemotron's hand, which cards were legally playable, the raw model response, the parsed decision, whether fallback was used, and latency.
- IT Director reactions, including the played card, card effect, current crisis, recent Director lines, raw model response, generated bilingual quote, fallback status, and latency.
These traces are uploaded to a Hugging Face Dataset: build-small-hackathon/dod-agent-traces. The upload uses HF_TOKEN_DATASET, the same separation I used for private dataset access elsewhere, so it does not interfere with Hugging Face OAuth or local mock login.
I do not treat this as a full observability platform. It is intentionally small. But it gives judges and curious developers a way to inspect the model loop after a match: what the game sent, what the model answered, what the backend accepted, and when a safety fallback stepped in.
This also made the "AI is load-bearing" claim more concrete. Nemotron is not just mentioned in the README. Its decisions leave traces.
19. Internationalization Needs To Start In The Architecture
The app is bilingual in English and Portuguese, but that required discipline:
- UI strings in dictionaries;
- cards with localized names and feedback;
- logs rendered by language;
- localized leaderboard;
- localized manual;
- localized OAuth button;
- global language selector;
- care with timers that update components and can accidentally revert text to English.
I chose to start with two languages for practical reasons. Portuguese is my native language, which let me evaluate whether the tone felt natural. English, on the other hand, is the broadest language for an international hackathon submission and works well for most of the audience.
But a truly multilingual game would need a more sophisticated architecture. Ideally, it would have separate prompts by language, few-shot examples tailored to each culture, and calls that respect the languages of the players in the match. In a table with players using different languages, it might be necessary to generate personalized lines per language, or even keep specialized model servers running in parallel to avoid literal translations or mixed styles from a single bilingual prompt.
That became a clear future improvement. The current version shows that PT/EN can work well when the scope is controlled, but it also made clear that AI internationalization is not just label translation: language generation needs to be designed for each audience.
This lesson came back repeatedly: i18n is not a final translation pass. It needs to be part of state design from early on.
20. Visual Polish Is Gameplay
Many small adjustments changed how playable the game felt:
- card text readable on every color;
ResandPanfitting without line breaks;- draw pile using assets instead of a fixed emoji;
- mute/unmute working for player and spectator;
- lobby music looping and pausing during the match;
- buttons without progress spinners;
- manual with cards rendered in the same style as the game;
- global language selector moved so it does not overlap the log.
- visual breathing room between turns, so Director audio has time to land;
- steadier typography in headings and the leaderboard, removing spacing that made letters like
AandVlook uneven.
In a small game, UI is not cosmetic. If the player cannot read the card, they cannot understand the rule. If the toast repeats an old message, they think the game is bugged. If old audio plays, the match feels out of sync.
21. Match End Is A Whole-System Transition
Ending a game is not just game_started=False.
When the match ends, several things need to happen:
- stop active audio;
- discard pending TTS;
- clear old toast state;
- preserve Hugging Face identity;
- update leaderboard;
- start the countdown;
- rotate the queue;
- return the user to the lobby;
- avoid re-showing the manual name field if the user is still logged in;
- prepare the next match with Nemotron.
Several difficult bugs were born exactly there. Human victory, Nemotron victory, abandonment, panic game over, and timeout looked similar, but each had small side effects.
Lesson: match end is a global transition. It deserves its own flow, not just an if at the end of a move.
22. Server Logs Need UX Too
During development, detailed logs saved a lot of time. But for a demo, fallback, mapper, TTS, and connection logs turned into noise.
The solution was DOD_DISABLE_LOGS=True, while keeping:
- compact bot decisions;
- critical errors;
- silence for warmup, mapper, expected TTS issues, and operational fallbacks.
This does not affect the in-game match log. It only controls the server console.
Lesson: even logging needs a development mode and a presentation mode. Console output has UX too, weird as that sounds.
23. Sometimes The Product Also Requires Changing A Dependency
In parallel with the game, TTS opened an unexpected workstream: NanoVLLM/VoxCPM2 support.
During development, I noticed that the original project did not yet have adequate Windows support for the local path I needed. That mattered because one of the project goals was to allow local execution, not only a hosted demo. To run the pipeline on my development machine, I needed to adjust the original project and temporarily use a parallel branch until those changes are reviewed and, ideally, merged upstream.
Another point was seed-based generation. VoxCPM2 could already use a reference voice, but a small tone variation was still noticeable between generations. In a game, that variation can feel like character inconsistency: the Director sounds slightly different from one line to another. By adding seed support to voice generation, I could preserve the tone better and make the Director's presence more stable.
For both cases, Windows support and seed control, I opened PRs in the OpenBMB repository. Until those PRs are merged, the game temporarily uses my branch with those changes.
This was an important lesson: when an app depends on very recent model runtimes, the work does not stop at integration. Sometimes you need to go one layer down, fix or adapt the tool itself and, when possible, send that improvement back to the ecosystem.
24. What I Would Do Differently
Some things only became clear later:
- I would formalize lobby and queue state from day one.
- I would separate OAuth identity, player name, and match state earlier.
- I would create the endpoint mapper before depending on fixed URLs.
- I would add agent tracing earlier, because it helps debug and explain model behavior.
- I would define from the start which logs are for development and which are for operation.
- I would create the player manual earlier, because it helps reveal ambiguous rules.
- I would treat TTS as an asynchronous system with logical cancellation from the beginning, not as "download audio and play it".
25. Full Stack
The final stack ended up looking more like a small distributed system than a simple Gradio app.
Interface layer:
- Gradio 6;
- custom components with
gr.HTML; - browser JavaScript for board, audio, toasts, color picker, and reactive state;
- custom CSS for the arcade look;
- Hugging Face OAuth for authenticated identity.
Game layer:
GameManageras the authoritative state;- card rules, queue, lobby, turns, accusation, and victory in the backend;
- Gradio timers for synchronization;
- lobby with
DOD_MAX_PLAYERS,DOD_MIN_PLAYERS_TO_START, and countdown; - local or remote leaderboard.
AI layer:
- NVIDIA Nemotron Nano 4B for bot decisions;
- JSON-schema prompts for structured decisions;
- IT Director generated by LLM with card, crisis, and recent-quote context;
- local fallback for minimal moves when the endpoint fails;
- fixed crisis quote fallback when Director generation fails.
Voice layer:
- VoxCPM2 / NanoVLLM for TTS;
- Modal as an optional remote runtime for the TTS server;
- reference voice;
- seed for tone stability;
- separate TTS queue;
- stale audio discard through
audio_generation_id; DOD_DISABLE_TTS=Trueoption for development.
Infrastructure layer:
- Hugging Face Spaces for the public app;
- Hugging Face Datasets for inference mapper, leaderboard, and agent traces;
inference_map.jsonas a dynamic endpoint router;- append-only
dod_agent_traces.jsonlsynced tobuild-small-hackathon/dod-agent-traces; - LLM and TTS warmup before match start;
- local data support with
DOD_USE_LOCAL_DATA=True; - local API support with
DOD_USE_LOCAL_API=True; - Modal as a remote TTS option;
x-ip-tokento preserve quota context in ZeroGPU Space-to-Space calls.
26. High-Level Architecture
The diagram below uses Mermaid. It renders directly in GitHub Markdown and helps show how the pieces talk to each other.
flowchart LR
Browser["1 Browser click"] --> Board["2 Custom Board"]
Board --> App["2 server_functions"]
subgraph Core["Game Core"]
App --> Manager["3 GameManager"]
Timers["Timers"] --> Manager
Manager --> Warmup["Warmup Gate"]
Manager --> State["7 Board State"]
end
State --> Board
Board --> Audio["7 Browser Audio Queue"]
Audio --> Browser
subgraph AI["AI Flow"]
Manager --> LLMQueue["4 llm_queue"]
LLMQueue --> Bot["8 Bot Decision"]
LLMQueue --> Director["4 Director Quote"]
Bot --> Predict["8 predict_llm"]
Director --> Predict["5 predict_llm"]
Predict --> Mapper["Inference Mapper"]
Warmup --> Mapper
Mapper --> LLM["LLM Endpoints"]
LLM --> GeneratedQuote["5 Generated Quote"]
GeneratedQuote --> Director
Bot --> BotFallback["Rule Fallback"]
Director --> QuoteFallback["5 Quote Fallback"]
end
subgraph TTS["Voice Flow"]
GeneratedQuote --> TTSQueue["6 tts_audio_queue"]
QuoteFallback --> TTSQueue
Warmup --> TTSEndpoints
TTSQueue --> TTSEndpoints["TTS Endpoints"]
TTSEndpoints --> AudioPayload["6 Generated Audio"]
AudioPayload --> PendingAudio["6 pending_audios"]
end
PendingAudio --> State
subgraph Data["Data & Identity"]
Manager --> Leaderboard["Leaderboard"]
Leaderboard --> Storage["CSV or HF Dataset"]
Manager --> Traces["Agent Traces JSONL"]
Traces --> TraceStorage["HF Trace Dataset"]
Mapper --> MapStorage["Local or HF mapper JSON"]
OAuth["HF OAuth"] --> App
App --> Token["x-ip-token"]
Token --> Predict
end
Short move flow:
- The player clicks a card on the custom board.
- JavaScript calls a Python function through
server_functions. GameManagervalidates the rule, updates state, and records the event.- If the card needs a Director reaction, a task enters
llm_queue. - The LLM generates the contextual Director quote; if it fails, the crisis fallback is used.
- The quote enters the TTS queue; if audio arrives in time, it is delivered per player.
- The next board poll receives state, log, and pending audio.
- If it is Nemotron's turn, another task enters
llm_queueto decide the bot move. - Nemotron decisions and Director reactions are appended to the trace JSONL and synced to the trace dataset.
27. Final Summary
DOD - Deploy or Draw taught me that a small AI game works best when the model does not try to be the whole game.
The LLM makes decisions inside a contract. The backend validates. TTS creates presence, but can be turned off. The leaderboard only trusts authenticated identity. Agent traces make the model behavior inspectable. The lobby is treated as part of the match. The app supports fallbacks because external services fail.
AI is essential to the experience, but it does not need to carry all the system reliability alone.
Maybe that is the main lesson I take from this: to build something fun with small models, the secret is not asking them to be huge. It is building a stage where they can shine without bringing the whole play down when they stumble.