The Rizz Therapy Show

Published June 11, 2026

HuggingFace and Gradio launched their Build Small Hackathon recently, with a beautiful constraint: Utilizing models that have only up to 32 billion parameters.

Space: Rizz Therapy

The hackathon featured two tracks:

Backyard AI: I had to solve a real problem for someone I know.
An Adventure in Thousand Token Wood: I had to build something delightful, weird and whimsical that wouldn't exist with AI.

I instantly knew I had to do the second track, because I love building things that are fun and entertaining. So I decided to build a Rizz Therapy show, where two AIs come with rizz problems, and a rizz therapist guides them and improves their rizz.

I had never used ZeroGPU, so I had to try it too. So I looked into how it works, where it excels at, what are its constraints etc.

The thing about ZeroGPU is that you don't get billed for usage of users, no matter if 10 people use it or 1000. They have a specific amount of ZeroGPU quota available to them depending on authentication and subscription (2 mins/day for unauthenticated, 5 mins/day for Free, 40 mins/day for Pro).

BTW that's an RTX Pro 6000. Are you kidding me HuggingFace? Why are you giving THAT BEAST of a graphics card for this many minutes EVERY DAY!? Just WOW. You guys are awesome. Please don't change this scheme lol.

What I wanted

Bilingual: It should support both English and Hindi languages. The LLM should be very creative and prompt adhering in both.
Instantly Shareable: Episodes should be instantly shareable with a custom watch link.
Blazing Fast: It should be fast, targeting full episode generation (script writing, audio generation and forced alignment) all under 1 minute.
Expressive Audio: The TTS should be reaching State of the Art in both languages.

The Hunt for the Best Models

Large Language Model

Nothing beats Google DeepMind's models in creative and multilingual outputs. But Nvidia came promoting their Nemotron models. So I tried Nemotron 3 Nano 30B A3B, but it couldn't match my expectations on the comedic and structured writing even in English, let alone in Hindi.

So I chose Gemma 4 26b a4b. Because of its 4b active params, this little thing runs at monster speed specially at lower bit quants like Q4_K_M which is what I used.

But to get that speed I had to use llama.cpp, and running it on ZeroGPU is not easy. You can't initialize the full model in global context as it will get stored in RAM. And when ZeroGPU would run, unlike torch, it won't even use the GPU. So I initialized the full model inside the @spaces.GPU function. So every time a GPU function runs, it loads the model from disk which is not that slow (~5 seconds), given that we solved the biggest roadblock of our project.

Now Gemma was running at around 110 tok/sec on ZeroGPU, and was writing very good scripts in structured json. BTW the speed is faster than all HF inference providers providing this model, and every provider on openrouter.

Text To Speech

I was already using OmniVoice in my other projects, as it's insanely fast and good given it's just 0.6b params. But again, OpenBMB came as a sponsor, promoting their models including VoxCPM2 TTS which is also multilingual, but it's 2B params. (~3.33x larger than OmniVoice), which meant it ran a lot slower than OmniVoice. I just couldn't compromise on speed. That's the whole point of optimizations. I needed my project to be usable even by friends, who don't have an HF account.

The difference between a Space which uses 1 min of ZeroGPU and one which uses 3 min, is a LOT, it's like they can create episodes and share it with others or they can't even run it to begin with, because of 2 min unauthenticated quota.

Forced Alignment

I obviously had to do lipsync on characters, and change their gazes dynamically mid sentences. The dialogues are like: Okay Mike, <look_at:mike> I think you aren't getting it. She <look_at:mia> just wants to watch a horror movie with you. That's all. I could do lipsync at runtime on Unity side (but it would have used too much CPU on client browser at runtime), and have removed look_at tags all together. But it just feels so good, when the characters change their gazes mid sentence to look at each other.

So I used mms-1b-all for CTC as it supports a ton of languages. It takes audio and outputs character probabilities frame by frame. Forced alignment is performed on those outputs to get character and word level timings for the dialogues, and the characters are mapped to visemes for lipsync.

Architecture

Here is a quick architecture diagram I made in excalidraw (excellent for diagrams btw, and open source).

Costumes. Yay!!

Okay but having only 1 costume per character is boring. Isn't it? So I proceeded by setting up 15 different costumes. Here are few of them which I love.

The Result

Now we have the full pipeline which generates a full episode in under 1 minute, instantly shareable, high fidelity audio, expressions and gazes just as we wanted. And it supports both Hindi and English. For me, seeing Rahul and Neha get help on their rizz problems is just other worldly.

This would not have been possible even with a pricey subscription one year ago. So I'm insanely grateful for Google Deepmind for their Gemma 4, k2-fsa for their OmniVoice TTS and HuggingFace for ZeroGPU.

Some other Problems

Structured Output Formats

Requesting structured output with a json schema utilizes Grammar checks using the CPU on every token, which throttles the GPU, bringing the speed down to ~20 tok/sec. So I couldn't use structured outputs. Lucky for me, Gemma 4 26b a4b is that good, that it can output correct json most of the time. So we tuned our system prompt, and passed the raw LLM output to json.loads().

But this method wasn't good enough. Sometimes it missed double quotes in the json or included an extra space in keys, ruining the entire script. I got to know about json-repair, now it was repairing the output json with a schema. Now the script generation was lot more robust.

Outbound network bandwidth

Unity WebGL build was around ~130MB, but every time a user was visiting the space, the server was sweating and was too slow to serve those files. Also each episode came around as ~6MB, which was also becoming a problem, when a new friend visited the episode to watch, the server had to serve this.

At one point, the speed was down to kbps, so I switched to Cloudflare R2 for Unity build and episodes distribution. In offline mode, it uses your local disk to serve as the repo ships with the Unity builds needed, and the server mounts the required directories.

15 Costumes

WebGL loads everything in RAM, so when my Unity build shipped with 15 VRoid characters, it OOMed out every time. First I reduced to only 9 costumes, which was still hitting the limit of stable WebGL RAM recommendations (around 2 GB). I needed it to stay around ~1GB RAM.

This way, I got to know about Unity addressables, which lets me host my remote bundles anywhere (local server or R2 in our case), and Unity can load them at runtime when needed. Now I could do as many characters as I want and it would use the same amount of RAM, as each episode only needs 3 characters.

Thank you HuggingFace and Gradio. I came in wanting to build a rizz show, and I'm walking away knowing a lot more about model optimizations for ZeroGPU, Unity Addressables and pipeline engineering than I did a week ago.

Models mentioned in this article 4

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote