Viva Mais AI: 1B models for one person

Community Article
Published June 15, 2026

This is my Build Small Hackathon project: Viva Mais AI, a local WhatsApp copilot for a travel agent. I am writing it the way it actually happened, including the parts that did not work. If you only have a minute: I tried to build the smallest, most specific thing I could, for one real person, and run it entirely on tiny models on her own machine.

The idea: stop building for everyone

A lot of AI tools aim to help everyone. "Upload anything, ask anything." It is an exciting pitch, and I wanted to try the other direction. What is the smallest, most specific thing I could build that gives one real person real value, something they would actually use every day?

The person is Lea, who runs a small travel agency called Viva Mais Turismo. She sells airline tickets and books trips, and she runs her entire business through WhatsApp. No Notion, no spreadsheets, no CRM. She negotiates, sends quotes, collects payment receipts, chases passenger documents, and answers customers late at night, all in one chat app. It is messy and it works for her. The one thing she hates is repetitive data entry, and she is the only employee, so there is no one to hand it to.

So the question got smaller and sharper. Can open source, local, offline AI pull the real business out of her messages, the payments, the trips, the next thing she has to do, without her changing how she works and without her client data leaving her computer?

That is the whole project. Build something so small and so specific that it helps one real person, and let that be enough.

What it does

She exports a WhatsApp conversation, the normal zip with text, images, and voice notes, and uploads it. The app reads everything and builds a CRM card for that customer:

  • Every image and PDF becomes one typed record. The model decides what it is: an air quote, a flight reservation, a boarding pass, a payment receipt, a sale invoice, a hotel booking, an identity document, or a plain chat screenshot. Anything it cannot place is kept as is, with its full transcription, not silently dropped.
  • Voice notes are transcribed.
  • Payments are added up against the sale, so she sees what is paid and what is still owed.
  • The customer is placed on a sales pipeline, with the single next action to take.
  • She can ask plain questions like "who still owes me?" and get a grounded answer.

Everything runs on small models, in process, on the machine. No external API. Client data never leaves. For a business built on private conversations, privacy is the whole point of the design, and everything else has to serve it.

This list looks obvious now, but it was not obvious to me at the start. My first version solved the wrong problem, because I had not interviewed her properly. I assumed I knew what mattered, mapped the data to my own guesses, and built records for things she did not care about while missing the ones she lived by. More code could not fix that. I had to go back and ask better questions in a second interview. Only then did the real shape of her business show up: quotes turning into reservations, receipts chasing a balance, documents blocking ticketing. Getting the domain wrong first, and being willing to throw it out and re-interview, was one of the most useful failures of the whole project.

The stack, kept small on purpose

  • MiniCPM-V 4.6, OpenBMB's 1.3B vision model, fine-tuned by me, reads and classifies each image.
  • MiniCPM5-1B, also from OpenBMB and also fine-tuned by me, extracts the typed fields and answers questions.
  • A small Whisper model, fine-tuned on Brazilian Portuguese, handles voice notes.
  • An NVIDIA Nemotron embedding model grounds the question answering, so the 1B model reads a small precise context instead of the whole chat.

Every model is well under the hackathon size cap. The biggest one in the app is 1.3B. This is a "Tiny Titan" build by accident, because small was the point from the start.

Building with agents is not that easy

I built this mostly by directing coding agents, and I used different ones for different work. For most of the app I used Claude, plus a combination of DeepSeek and Grok. For the fine-tuning work I used Codex with GPT-5.5 at x-high reasoning, because that work is long, fiddly, and unforgiving of a careless step.

The thing I actually learned is that the limit was me. I work as an AI engineer, not a researcher, and here that gap mattered. The agents could go a long way, but only as far as I could actually steer them. On the app, where I understood the problem well, they flew. On the fine-tuning, where I did not, they would confidently do the wrong thing and I could not tell, because I did not yet know enough to catch it. The agents did not save me from learning. They raised the ceiling on what I could build, but the floor was still my own understanding. So I had to go and actually learn how fine-tuning works, the data, the failure modes, the evaluation, before the agents were any use on it at all.

For the model training work I leaned on a separate planning and execution loop, with every run written down in a log as it happened. That running log is the reason I can write this post honestly, because the failures are all there with dates and numbers.

The honest part: fine-tuning, which humbled me

I had not really fine-tuned models before this. The two models went very differently.

Vision: a real win, after some wrong turns

My first instinct was to force the vision model to emit structured JSON directly from each image. It failed, and at first I blamed the model. I was wrong to. The images are wildly inconsistent: phone photos of screens, glare, crops, different receipt layouts. Asking one small model to both read a noisy photo and produce perfect typed fields in one shot was simply too much to demand at once.

So I split the job. One stage transcribes and classifies the image. A second stage takes that clean text and produces the typed fields, with the model behind a clean interface so I never parse anything with brittle regex. I also made the tables editable, so when the model is wrong, she just fixes the cell. That small decision matters more than it looks: the product can be wrong sometimes, as long as a human can correct it in seconds.

The real fight was accents. Brazilian Portuguese lives on its accents (não, São, conexão, R$ 1.234,56), and a de-accented output is wrong and looks careless. I found three separate copies of a glossary in my own code that had quietly dropped the accents, which means I was teaching the model to drop them too. I also learned the hard way that some image fonts silently throw away accent marks when rendering, so my synthetic training data was full of accent-less text that no amount of training could fix. Fixing the data at the source was most of the battle.

I then built a distillation pipeline: a larger Portuguese vision model labeled real images, and the small MiniCPM-V student learned from those labels plus fixed synthetic data. There were plenty of small disasters along the way, a reasoning teacher that would not stop thinking and never emitted clean JSON until I forced it, PDFs that could not be opened as images, and a memorable moment where I was convinced a training run had died at 6 percent when in fact it had finished perfectly and I had been looking at the logs of a different, crashed run.

Here is the part that genuinely amazed me. A simple fine-tune, on mostly synthetic data, on an OpenBMB vision model with barely more than a billion parameters, just worked. This is a tiny model. I rendered my own training documents, fixed the accents at the source, and that alone moved the needle before I added any real data at all. On a held-out evaluation, comparing the base model to my fine-tune:

  • Accent preservation went from 0.71 to 0.88.
  • Character error rate dropped from 0.50 to 0.18.
  • Document type accuracy went from 0.01 to 0.46.

The fine-tune beat the base model on every metric I measured. The evaluation set overlaps the training data and the labels came from a model rather than a human, so I trust the direction more than the exact decimals, but the direction is not subtle and it shows in the app. A model this small, taught with data I generated myself, reads Lea's receipts and tickets well. That still feels like a small miracle to me.

Text: a long string of rejections

The text model was humbling in a way the vision model was not.

The job sounds easy: answer questions about one customer's dashboard. "How much does she still owe?" "What is the flight number?" In practice a 1B model on this task has two failure modes that are exactly the ones you cannot ship. It leaks, meaning it answers about the wrong customer or invents a value that is not there. And it over-answers, meaning when the honest reply is "that is not in the conversation," it makes something up anyway.

I trained candidate after candidate: 001, 002, then a v1, v2, v3, and v4 line. Most of them I rejected with my own evaluation gate. Some beat the base model on average score and still failed, because a higher average is not good enough when the model confidently tells you another client paid R$ 1.800 when they paid R$ 1.750. I learned to stop trusting a single number that went up. I expanded my test set from 32 cases to 158, added categories for leakage and refusal specifically, and only then could I see clearly that my "better" models were often trading one kind of error for another.

I also had to know when to stop. At one point I tried to use a very large 397B Portuguese model as the teacher. It would not fit, it ran out of memory even across eight H100 GPUs, and the cost to push further was not worth it for one decision. I switched to a 4B teacher instead. Picking the smaller, cheaper, good-enough tool was the right call, and it is very much in the spirit of this hackathon.

Where it landed: the deployed text model is better than the base model at the dashboard questions and noticeably better at Portuguese. It still leaks and over-answers more than I want, and it does not fully pass my strictest gate, and I think that is the right thing to say out loud, because the gate exists exactly so I do not fool myself. This is a hard task for a one billion parameter model, and it earned its place by being a real improvement that says "I do not know" more often than the model it replaced.

What works, and what does not

What works:

  • The whole pipeline runs offline, on small models, on a normal machine. Client data never leaves.
  • Vision extraction measurably improved over the base model, especially on accents and Brazilian formats.
  • Question answering is grounded by retrieval, so the small model reads a small, relevant context instead of the entire chat.
  • The interface is fully custom, built as a travel agency desk with boarding pass cards and a departures board, with no stock components.

What does not, yet:

  • The text Q&A model still leaks across customers and over-answers on no-information questions more than I would like.
  • My evaluation sets are small and some labels come from models, so I trust trends more than exact scores.
  • It runs on modest hardware, so it is built for one conversation at a time, not a busy multi-user service.

Why this is the right kind of small

I built one tool for one person. It takes Lea's messy WhatsApp history and hands her back the business that was buried in it, on her own machine, using models small enough to run there.

And it worked for her. Lea got real, valuable data out of Viva Mais Turismo's own conversations: who paid, who still owes, where each trip stands, what to do next. That data came out of incredibly small open source models, running locally, and fully offline if she wants it that way. Her client conversations never have to leave her computer to become useful. She was satisfied with what it gave her, which is the only review of this project that actually counts.

That is what "build small" means to me. I mean a real solution scoped down to the size of one person's actual life, running on models small enough to respect their privacy, rather than a shrunken copy of some big idea. It helped Lea, so it was worth building, and that is the whole point.

Social Media Post/Demo HF Space

Community

Sign up or log in to comment