Viva Mais AI
Local-first WhatsApp copilot for the Viva Mais travel agency
A lot of AI tools aim to help everyone. "Upload anything, ask anything." It is an exciting pitch, and I wanted to try the other direction. What is the smallest, most specific thing I could build that gives one real person real value, something they would actually use every day?
The person is Lea, who runs a small travel agency called Viva Mais Turismo. She sells airline tickets and books trips, and she runs her entire business through WhatsApp. No Notion, no spreadsheets, no CRM. She negotiates, sends quotes, collects payment receipts, chases passenger documents, and answers customers late at night, all in one chat app. It is messy and it works for her. The one thing she hates is repetitive data entry, and she is the only employee, so there is no one to hand it to.
So the question got smaller and sharper. Can open source, local, offline AI pull the real business out of her messages, the payments, the trips, the next thing she has to do, without her changing how she works and without her client data leaving her computer?
That is the whole project. Build something so small and so specific that it helps one real person, and let that be enough.
She exports a WhatsApp conversation, the normal zip with text, images, and voice notes, and uploads it. The app reads everything and builds a CRM card for that customer:
Everything runs on small models, in process, on the machine. No external API. Client data never leaves. For a business built on private conversations, privacy is the whole point of the design, and everything else has to serve it.
This list looks obvious now, but it was not obvious to me at the start. My first version solved the wrong problem, because I had not interviewed her properly. I assumed I knew what mattered, mapped the data to my own guesses, and built records for things she did not care about while missing the ones she lived by. More code could not fix that. I had to go back and ask better questions in a second interview. Only then did the real shape of her business show up: quotes turning into reservations, receipts chasing a balance, documents blocking ticketing. Getting the domain wrong first, and being willing to throw it out and re-interview, was one of the most useful failures of the whole project.
Every model is well under the hackathon size cap. The biggest one in the app is 1.3B. This is a "Tiny Titan" build by accident, because small was the point from the start.
I built this mostly by directing coding agents, and I used different ones for different work. For most of the app I used Claude, plus a combination of DeepSeek and Grok. For the fine-tuning work I used Codex with GPT-5.5 at x-high reasoning, because that work is long, fiddly, and unforgiving of a careless step.
The thing I actually learned is that the limit was me. I work as an AI engineer, not a researcher, and here that gap mattered. The agents could go a long way, but only as far as I could actually steer them. On the app, where I understood the problem well, they flew. On the fine-tuning, where I did not, they would confidently do the wrong thing and I could not tell, because I did not yet know enough to catch it. The agents did not save me from learning. They raised the ceiling on what I could build, but the floor was still my own understanding. So I had to go and actually learn how fine-tuning works, the data, the failure modes, the evaluation, before the agents were any use on it at all.
For the model training work I leaned on a separate planning and execution loop, with every run written down in a log as it happened. That running log is the reason I can write this post honestly, because the failures are all there with dates and numbers.
I had not really fine-tuned models before this. The two models went very differently.
My first instinct was to force the vision model to emit structured JSON directly from each image. It failed, and at first I blamed the model. I was wrong to. The images are wildly inconsistent: phone photos of screens, glare, crops, different receipt layouts. Asking one small model to both read a noisy photo and produce perfect typed fields in one shot was simply too much to demand at once.
So I split the job. One stage transcribes and classifies the image. A second stage takes that clean text and produces the typed fields, with the model behind a clean interface so I never parse anything with brittle regex. I also made the tables editable, so when the model is wrong, she just fixes the cell. That small decision matters more than it looks: the product can be wrong sometimes, as long as a human can correct it in seconds.
The real fight was accents. Brazilian Portuguese lives on its accents (não, São, conexão, R$ 1.234,56), and a de-accented output is wrong and looks careless. I found three separate copies of a glossary in my own code that had quietly dropped the accents, which means I was teaching the model to drop them too. I also learned the hard way that some image fonts silently throw away accent marks when rendering, so my synthetic training data was full of accent-less text that no amount of training could fix. Fixing the data at the source was most of the battle.
I then built a distillation pipeline: a larger Portuguese vision model labeled real images, and the small MiniCPM-V student learned from those labels plus fixed synthetic data. There were plenty of small disasters along the way, a reasoning teacher that would not stop thinking and never emitted clean JSON until I forced it, PDFs that could not be opened as images, and a memorable moment where I was convinced a training run had died at 6 percent when in fact it had finished perfectly and I had been looking at the logs of a different, crashed run.
Here is the part that genuinely amazed me. A simple fine-tune, on mostly synthetic data, on an OpenBMB vision model with barely more than a billion parameters, just worked. This is a tiny model. I rendered my own training documents, fixed the accents at the source, and that alone moved the needle before I added any real data at all. On a held-out evaluation, comparing the base model to my fine-tune:
The fine-tune beat the base model on every metric I measured. The evaluation set overlaps the training data and the labels came from a model rather than a human, so I trust the direction more than the exact decimals, but the direction is not subtle and it shows in the app. A model this small, taught with data I generated myself, reads Lea's receipts and tickets well. That still feels like a small miracle to me.
The text model was humbling in a way the vision model was not.
The job sounds easy: answer questions about one customer's dashboard. "How much does she still owe?" "What is the flight number?" In practice a 1B model on this task has two failure modes that are exactly the ones you cannot ship. It leaks, meaning it answers about the wrong customer or invents a value that is not there. And it over-answers, meaning when the honest reply is "that is not in the conversation," it makes something up anyway.
I trained candidate after candidate: 001, 002, then a v1, v2, v3, and v4 line. Most of them I rejected with my own evaluation gate. Some beat the base model on average score and still failed, because a higher average is not good enough when the model confidently tells you another client paid R$ 1.800 when they paid R$ 1.750. I learned to stop trusting a single number that went up. I expanded my test set from 32 cases to 158, added categories for leakage and refusal specifically, and only then could I see clearly that my "better" models were often trading one kind of error for another.
I also had to know when to stop. At one point I tried to use a very large 397B Portuguese model as the teacher. It would not fit, it ran out of memory even across eight H100 GPUs, and the cost to push further was not worth it for one decision. I switched to a 4B teacher instead. Picking the smaller, cheaper, good-enough tool was the right call, and it is very much in the spirit of this hackathon.
Where it landed: the deployed text model is better than the base model at the dashboard questions and noticeably better at Portuguese. It still leaks and over-answers more than I want, and it does not fully pass my strictest gate, and I think that is the right thing to say out loud, because the gate exists exactly so I do not fool myself. This is a hard task for a one billion parameter model, and it earned its place by being a real improvement that says "I do not know" more often than the model it replaced.
What works:
What does not, yet:
I built one tool for one person. It takes Lea's messy WhatsApp history and hands her back the business that was buried in it, on her own machine, using models small enough to run there.
And it worked for her. Lea got real, valuable data out of Viva Mais Turismo's own conversations: who paid, who still owes, where each trip stands, what to do next. That data came out of incredibly small open source models, running locally, and fully offline if she wants it that way. Her client conversations never have to leave her computer to become useful. She was satisfied with what it gave her, which is the only review of this project that actually counts.
That is what "build small" means to me. I mean a real solution scoped down to the size of one person's actual life, running on models small enough to respect their privacy, rather than a shrunken copy of some big idea. It helped Lea, so it was worth building, and that is the whole point.
Local-first WhatsApp copilot for the Viva Mais travel agency
More from this author