Meet Puck

Community Article Published June 15, 2026

Screenshot 2026-06-16 at 12.36.30 AM

A Build Small entry — Thousand Token Wood (Creative). Play with it: the Space · Watch: 2-min demo

Puck is a small, mischievous creature that lives on your desktop. He roams, peeks at one little patch of whatever you're doing, and murmurs a single in-character line about it — then drifts on. He's not an assistant and not a notifier. He's company: marginally useful, reliably charming. Here are the notes from building him — mostly the places where the obvious approach was wrong and a small model (or a tiny trick) turned out to be the right one.

1. He started as a notification engine. That was a mistake.

The first version alerted on everything — build failures, mentions, mail, a finished agent run. It was useful and completely charmless, and it inundated you. A round of research into notification UX (the calm-technology and "ambient/peripheral display" literature) said the quiet part out loud: the value of a companion isn't in its alerts, it's in its presence. So the alert engine went dormant and Puck was rebuilt around one loop: roam → peek → quip. ~95% ambient, near-zero interruptions. He's "marginally useful" on purpose.

2. The VLM both sees and speaks — one call, not two.

The instinct is a pipeline: a vision model describes the patch → a text model writes a quip in character. Two calls, two models, more latency, more drift. Instead a single 12B VLM — Holotron-12B, H Company's computer-use model post-trained from NVIDIA's Nemotron-Nano-12B-VL — gets a system prompt that is Puck, and reacts to the image directly. The model that sees is the model that speaks — and the quip is grounded in pixels, not a lossy description.

3. OCR beats CLIP for "which coding agent is this?"

Puck should know whether you're in Claude Code, Codex, opencode, or pi. The hackathon-shaped answer is a CLIP fingerprinter: embed labeled screenshots, match by cosine. I built it… and it was fragile. On dark terminal screens with small text, every embedding clusters around 0.85–0.95 cosine — the margin between "this is Claude" and "this is Codex" was ~0.05, and a blank dark patch would confidently match something.

CLIP-ViT-B/32 at 224² simply can't read the text that distinguishes these tools. So I stopped asking it to. macOS ships a perfectly good on-device OCR (the Vision framework); a tiny Swift binary reads the prompt/status line and a keyword map nails the tool deterministically — 10/10 across all five CLIs, in ~0.25s. The discriminator was never the look; it was the words (gpt-5.5 xhigh, GLM-5.1, OpenCode, pi v0.78). And it's region-local — it reads the patch under the sprite, unlike a window title, which lies under tabbed terminals and browsers.

4. Getting emotions out of a small VLM: don't ask it to format, and don't trust the quip.

Each peek should carry an emotion that drives Puck's gesture, color, and voice. Two surprises:

  • The 12B won't reliably emit a structured tag. Asking for [amused] <line> got a lovely line and no tag, every time. Format-following is where small models are weakest.
  • The quip is a bad sentiment signalbecause Puck's voice is charming. Feed it rage-typing and it says "watching the terminal dance, wondering why the #@!…" — it understood the anger and then softened it. Classify that and you get "curious."

The fix: classify the emotion from the OCR'd screen text (where the real sentiment lives — ALL-CAPS, swearing, green checkmarks, a wall of tracebacks), as a separate one-word call. A single word is the one format a small model nails.

5. Camo: show what's behind, readably — not transparency, not blur.

Active camouflage: cloak Puck into the desktop. The first cut blurred and dimmed his body (frosted glass) — which looked cool and made the content behind him unreadable. The predator/thermoptic look is the opposite: the background shows through sharp, with just a shimmer + a refractive rim so you can tell a cloaked thing is there. The trick was dropping the blur entirely and using non-blurring filters (brightness/contrast/hue) for the shimmer.

6. The whole thing is small.

Nothing here is over 32B and it fits on a laptop: a 12B VLM for eyes-and-voice-of-the-fairy, an 88M CLIP fingerprinter, on-device OCR, and an 82M Kokoro neural voice running in the browser. Small models, composed, doing something that's mostly just… delightful.

7. Where this goes

The notification engine in §1 was a bad starting point — a bot that sprays you with every event hasn't earned the right to interrupt you. But "ambient companion" isn't the ceiling; it's the foundation you build trust on.

The future I see for Puck is an assistant that earns its usefulness. It watches (as it already does), but over time it learns what actually matters to you — then it does two things you'd otherwise do yourself: it bubbles up the few things worth your attention, and it handles the trivial ones (clicking, typing, submitting) so you can stay in flow. Not a louder notifier; a quieter one that's almost always right, plus a pair of hands for the busywork.

The engine for that is the part that looks like whimsy today: sleep. Every day's peeks land in the memory garden; at night Puck blooms them into a smaller, sharper sense of your world — and that distilled signal is exactly the dataset to fine-tune a model that knows your priorities, not a generic one's. Computer-use to act, learned relevance to decide, nightly fine-tuning to improve. He starts marginally useful on purpose, so that by the time he's useful for real, you already trust him.


Built with Hugging Face · Modal · Holotron-12B (post-trained from NVIDIA Nemotron). Try Puck: the Space.

Community

Sign up or log in to comment