You Are a Bug: A Hackathon Diary Post
Introduction
I was very excited to see the announcement for this hackathon for a few reasons. First, I have been tinkering at home with AI in my local environment: small LLMs, embeddings, creating simple RAG pipelines and experimenting with chunking strategies, etc. I thought my next toy project would be fine-tuning a small LLM for some kind of specific purpose, so this announcement was perfect timing.
Second, I am a content designer/conversation designer by day, and a D&D DM by night (well, as scheduling allows...), and the invitation to create an "LLDM" with a small model was too good to resist. What an amazing prompt.
Prologue: CONTEXT, the Game
I had been playing around with using Claude Sonnet to act as a DM in another toy project of mine some months back. I created an early proof of concept for a game called CONTEXT, in which you're an agency handler coordinating with secret agents out in the field.
Because LLMs can't do much, I thought I would lean into their text-in, text-out nature. So I made a game that takes place over email -- the UI is an email client. Under the hood, I had Sonnet construct agent personas by cobbling together characters at random from a series of tables: personality & manner of speaking, first name, nickname, nation of origin, cover story (eg. bartender, pop star), and weaknesses (eg. Alcoholic or Afraid of Snakes).
It's a little janky, and always crashes by the third game-day, but it's still pretty cool! I love how each agent speaks with a certain personality, and how Sonnet's deep knowledge means that agents can improvise and move around the world, even beyond the places I've' set as starting locations. An agent in Karachi once snuck on the back of a truck and got off in Hyderabad, 10 miles away. And his little agent card in the UI updated his location! I thought that was so sick -- true improv making it's way into the story, and the UI as well.
Leaning Into Limitations
That experiment also taught me how bad LLMs are at randomizing (and therefore, procedural generation), state tracking, and arithmetic. I learned that the generation tables had to get shuffled in code -- an LLM would almost always make the starting location Austria, and there would always be an agent named Margeaux with the codename "Canary," etc. So I learned how to off-load tasks to code, which would allow the LLM to focus on what it does best: natural language narration, with a smidge of code generation to package up the outputs, too.
In my own work-work, I've created AI-powered Figma plugins and other "magical UI" experiences where an LLM's response isn't just a block of prose -- it's a JSON response with an answer inside, plus other types of data and flags which would paint or modify a UI surface. So instead of a chatbot, you end up with a UX surface that answers you with updates, metadata, and perhaps even animations, based on simple code values that the LLM generates. I wanted to explore this more with smaller models! (And I'll note, it's been very cool to see the other hackathon entries that went down this path as well)
This post would get very long if I gave my whole spiel on why small models are the future. I've always wanted to try my hand at simple fine-tuning to make models that perform small tool calls and produce useful JSON. So I was thrilled to see this Hackathon's Thousand Token Wood prompt. My LLM would be terrible at math and state tracking, but I had already accepted that and moved past it.
Decisions, Decisions
So, so much time was spent in design decisions. As I did my research, it became clear just how much game design, system design, backend design, and model design are all a jumbled mess with no clear boundaries. You really had to be able to close your eyes, still your mind, and see all the inputs, schemae, outputs, code hooks, buttons and drop-downs all at once.
From the get-go, I was imagining a TTRPG game with a small LLM as the narrative engine. I knew I wanted a sheet-like experience with stats, HP, etc, and actual stakes like player death and Game Overs. And where there are stats, there are rolls.
My initial vision was way, way over-engineered. Yes, I knew I would use code to do simple math and pass in randomized world seeds for each game-day, but I was still imagining things like character sheet mutations (granting consumable items, and items/feats/boons with passive bonuses) and even pre-generating bugs based on inputs.
As I learned about simple fine-tuning and thought about the full E2E pipeline across the program's layers (prompts, LLM itself, code, UI) I started to understand that I needed to pare down even further.
Giving the player a reward means changing sheet. So does deducting HP, or using a potion to add HP. That's a whole new tool-call to build, and even worse, a whole new input and output shape to train for. I went in think about all the amazing inventory and boon sheanigans I could get into, but ultimately I realized I had to eat my most cherished larva for this project to survive. I drew a line in the sand: the LLM would narrate only. No direct sheet mutations of any kind.
Taking Shape
So, the model was trained to produce only two shapes, and decide when to use them: either a JSON response with a simple say that narrates something, or a roll that tests a stat (might|speed|smarts|mystique) with a difficulty (easy|medium|hard). What comes in as input is a simple card in a set format to provide context, with a standard shape to give structure to our inputs for fine-tuning:
Input: The player, world, previous turns, and most recent user input
PLAYER
Luna Moth, Level 1 — HP 4/4 · Moxie 5/5
Might 1 · Speed 1 · Smarts 6 · Mystique 2
A wise wizard type. Worldly and mysterious.
THE WORLD
Night, high summer. Warm dark thick with pollen. Restless and electric — the night shift coming awake.
WHAT JUST HAPPENED
[dm] The night is warm and buzzing. Creatures of the night start to hum and chitter as they join the forest's festivities, replacing day-timers.
[player] i fly up to the tree canopy and try to find blooms to drink from
[dm] A pulse of sound brushes your wings, then another, faster. Out of the warm dark a bat banks toward you, painting you with clicks. You are the loudest thing in its black sky, and it is hungry.
PLAYER'S TURN
[player] I fold my wings and drop like a leaf.
Output: A JSON roll response, testing a skill
{
"action": "roll",
"stat": "speed",
"difficulty": "medium",
"on_success": "You tumble out of the sound-beam, falling all wrong on purpose, and the bat's swoop slices the air where a sensible moth would have been.",
"on_fail": "You drop a half-beat late. Wing membrane brushes you, claws comb your tails, and you spin earthward, listening to the bat already turning back around."
}
"Wait, I see both the success and the fail here. Why?" Great question! Remember when I said I was taking all the math off the LLM's plate? That includes resolving skill checks! The LLM narrates what both the success and the failure would look like in a single response, then the code takes the player's stat, rolls a random number against it, resolves which happens, and displays the winner. Then that winning response becomes the canonical turn in the turn history, too. Kinda clever, huh? I hope so!
"But wait -- what's the deal with these Abilities? How do they work?" Another fabulous query. Abilities are teed-up and then fired with the user's input. They both get handed off to the LLM...
[player] i preside over the assembly and call it to order, as is my right!
[used ability: Heraldry — Your reputation precedes you. You succeed at any non-deceptive social roll, such as persuasion or intimidation.]
...and then the LLM just accepts it as an auto-win. In fact, every ability is just a fiat "This happens successfully." That's why they cost Moxie. On the backend, the LLM has been trained to accept them and output the result as a definite say response. No roll required.
Model Selection: Take 1
This part took a while, as I tried out a ton of small-B models. At first, Llama-3.2-3B-Instruct blew everything else away. In my local tests, it had great stylistic narration, output decent JSON responses just from prompting, even did a few roll outputs correctly on its own! And it's small-small. It was hard to beat!
Model testing also meant system prompt design in parallel, so that was developing as well at the same time. Once my system prompt was locked, I was ready to start on my actual dataset...
Datasets are Hard
This whole hackathon was a crash course, but getting my dataset together was the crashiest part. I have a full-time job and two small kids, so I knew I couldn't spend hours upon hours hand-authoring 500+ player input cards and output responses. A few months back, I was experimenting with batch APIs, because I thought it was cool that you could get a 50% discount on your requests if you're willing to get them back at "email speed". I also appreciated how they made batching more graceful for repeat or in-parallel work. So I made a CLI tool called Penpal that handles all the batching for you, from right within your coding assistant of choice. So I went that route, as was mostly pleased. Each session worth (which would get sliced into 3-7 individual turns) was just under 1¢ USD each.
I also had to create a review harness for hand-approving all the pairs. I couldn't write them, but I wanted to hand-approve them to ensure I was getting the right mix of scenarios, that roll turns were getting triggered and answered correctly, and that the style was something functional and special (although I don't love how the prose style turned out... more on that later).
Using Modal.com to fine-tune was really fun, and shockingly easy from the CLI. I had always heard that fine-tunes were an iterative process, and now I get it. As of this writing, I am on my sixth training session, fingers crossed. Despite the ease of this part, all the data wrangling was a serious pain.
I get it now! I understand why building, sorting, cleaning, and pruning datasets for ML is kinda-sorta the hardest part! At least in 2026. I learned a TON, and I would do it again, but I would absolutely do things differently next time.
The Actual Writing Was a Blast
This is the least technical part, but I did just want to note how much fun I had authoring all the stuff you see in the game's UI. That was 100% Josh. The 12 pre-generated bugs, their respective schticks, their abilities, and even the loading states and error messages were really fun to make. I love RPGs and DMing, and I love content design, but I had never done content design for a game before. This may be something I pursue further!
Model Selection: Take 2
Once I had my finished app out in the world, and all my friend playtesting it, I realized that 3B -- even at f16 -- wasn't cutting it. I really, really wanted to score the Tiny Titan badge and come in under 4B, but the performance just wasn't there. Just barely.
It was a tough call, but with my pipeline already setup and everything ready to go, it didn't take long to do the same fine-tune and merge against Qwen3-8B. The apples-to-apples comparison was hard to turn down, and the app was already pretty snappy (after the first turn...) on ZeroGPU. At Q8, it works much better. Check out the model and its adapter here, and let me know how you use it! Example trace included. I'm bummed I couldn't go smaller, but I wanted to deliver something fun, and only buggy in a good way 🪲
Thanks, and Lessons Learned
OK, I am running out of time. First, I want to thank the Gradio and Hugging Face teams, as well as the sponsors for putting together such a cool hackathon! I also want to thank all the other participants who asked great questions in the Discord, and all the HF and Gradio staff who were gracious enough to answer them. I have been exploring other participants' spaces during breaks this week, and I've been blown away buy all the wild stuff I'm seeing.
Things I would do differently next time:
- Pay attention to style more: The bulk model did a great job creating candidates, but this sort of annoying high poetic style ("the spring rain's gentle hush, the stream roaring like a drumline" --- blech!) started creeping into the dataset, and by the time I realized it, it was too late to start over. So I would keep an eye on my candidates more, and tweak the meta-generation prompt a bit before committing.
- Take more time to build my candidate pipeline: I was running all over trying to build (and repair!) an entire factory while also getting product out the door. It felt crazy. But now I know!
- Put more deep thinking into distribution of scenarios: I learned this eventually, but a little later than I should have. I thought I needed perhaps 8 or so scenarios to cover, but it was more like 19. I ended up generating more to provide a more "full spectrum" dataset as I went along, but like with above, I would have had a nicer and more efficient time if I could have seen around corners better and really though about all the many different types of player turns there could be, and how I would want the game to respond in those cases.