Agenda Parser: Field Notes
Outline of the Problem
Local governments in the United States (municipality or county) will publish agendas of an upcoming meeting with supporting materials. These are usually called agenda packets. These packets are often hundreds of pages in length and are not the most interesting of reads. The important and necessary information in these packets (think how much things cost, comparisons of alternatives, studies that were completed, etc.) is often buried . How can a local government official or the public stay prepared and informed with this amount of information to sift through?
My idea was to build infrastructure to parse and connect the pages of the agenda packet to their voting items. From there, the simplest thing to build is a RAG against these packet pages, and even better most smaller models have instruction tuned variants so one can think about building agents for more complicated requests.
Do Small Models Fit the Problem?
The obvious concern using smaller models on large documents is one of context management. Will we be able to fit a system prompt, tool calls, and repeated calls to the agenda in a 128K context window?
One technique we can pull from the early 2020s is to use a vector database and either do semantic search to pull the most relevant chunks or map reduce overall all the chunks until we get something usable with our smaller context window. If we wish to go into agentic flows, my observation and experience is the models don't have to be very powerful with the right constraints on the data and tools being used. Therefore, general strategy is as follows: RAG constrained to the text of a specific agenda item, and a small agent constrained to an agent with a number of tools to interact with the full text of an agenda packet.
One additional domain to consider for local governments: what if an agenda item requires a consideration of federal or state laws and regulations? For example, in Michigan, local governments are required to hold meetings in accordance with the Open Meetings Act, which sets a number of requirements on how politicians can communicate on business, how meetings must be announced, a requirement of public comment, etc. It is very common for a local representative to have these type of low-stakes legal questions, and the public would be well-served to make these types of ideas more accessible.
Cornell LII has a great repository of laws and regulations and a very permissive terms of use, and is a great candidate to build some domain-specific tools for these types of questions. Additionally, I'll make this more Michigan-centric by using the Michigan compiled laws section of our legislature page. One can also help a smaller model by including the structure of these resources in the system prompt, and that will be another lever to help performance with smaller models.
We will have three resources (RAG, agenda agent, Cornell LII/state regulation agent) and the next step is to choose models that are small enough that are or can be instruction tuned.
Picking a Model
For this project I chose to use the recently released Gemma 4 series of models. My choices of parameter sizes:
I liked that Gemma-4 was well-reviewed, had a number of models under 32B and had models that were already instruction tuned. Additionally, these models had a number of sizes, and I chose a 4B and the 26B-A4B model as a good compromise between size and performance. As we shall see, this gives a good delineation to create lite and higher-performance models, and it isn't much work to set up a toggle to switch between models.
After verifying I could run these in a zero gpu space, I wrote a train job in modal to get my own cpp-runnable versions of these models. Overall, I found these models to be pretty good for the use case once we set up some tools to query against agenda packets, and stuck with these for the project. However, I'm confident with the right system prompts and tools that any recent model around the 32B size would get to "good enough for a hackathon POC."
Simple Programming Wins
Woring with agenda packets have a number of simple wins that don't require AI. One can improve the UX of these packets by:
- Parsing the agenda items into nicer-looking components
- Matching the action items with their underlying agenda pages
We can do both of these with very nice looking components by using gr.Server and building a standard "React with Tailwind" look for the frontend.
Additionally, because we're using gradio, we can use localstorage to save data on our agenda packets (can't keep the agenda packets themselves, which is why one has to reupload them), and this is going to combine nicely with our agent runs.
Fine-tuning, GRPO, and .cpp Compatability
The stock Gemma 4 models are already decent at JSON tool calling, but I wanted to see how far a hackathon's worth of fine-tuning could push small models (given that we received $250 in modal credits). Hugging face has great libraries for distilation and building these workflows are much simpler when you have newer large coding agents help put them together for you.
My approach was teacher distillation. I used two strong open-weight teacher models (Kimi k2.6 and DeepSeek 4 pro) to generate a number of examples running through our agenda packet and Cornell LII agent tools. I did the same with our two chosen gemma-4-it models. From there, I had an LLM-as-judge (Minimax-2.7) to improve the quality of these traces.
One can find the dataset of the traces for this setup here.
The pipeline then is to complete a LoRA pipeline with GGUF quantinization (to make them .cpp compatable), and again that's pretty easy to automate between a coding agent and Modal.
The three model outputs:
- agenda-parser-lite (SFT and 8K quantinization)
- agenda-parser-medium (SFT and 4K quantinization)
- agenda-parser-high (SFT, GRPO and 8K quantinization)
As you can see I built a third model with an additional GRPO training step. Here's the full scoring rubric for that GRPO process:
| Signal | What it measures | Weight |
|---|---|---|
format |
fraction of steps that are a single valid JSON action | +1.0 |
completion |
reached a clean final_answer (not an error or step-limit ramble) |
+1.0 |
faithfulness |
judge: final answer grounded in the text the agent retrieved | +1.5 |
overall |
judge: overall answer quality | +0.75 |
tool_ok |
1 − tool-error rate | +0.5 |
efficiency |
finished without burning every step | +0.25 |
error (penalty) |
the turn crashed / a model call failed | −1.0 |
invalid_tool (penalty) |
rate of steps naming a non-existent tool | −0.5 |
Eval scoreboard (base vs fine-tuned)
To check whether any of this actually helped, I built an A/B eval. For each tier I run the stock base GGUF and my fine-tune through the same held-out tasks. I held-out a sample packet for this evaluation (oakland-1570) and added LII law questions related to local government. My idea was to do another LLM-as-judge setup (Minimax-2.7 again) to pit the SFT and stock models of the same side against each other and hope the SFT models were liked more by the judge head-to-head:
| model | ft wins | base wins | ties | ft win-rate |
|---|---|---|---|---|
| lite | 2 | 2 | 6 | 0.20 |
| medium | 3 | 0 | 7 | 0.30 |
| high (SFT→GRPO) | 2 | 2 | 6 | 0.20 |
Overall, they generally performed the same at this evaluation with medium being an exception. We also had a nice scoring setup with the GRPO so we can get that score along with more LLM-as-judge metrics on success, whether it passed hte action, whether the actions were valid and faithfullness to the packet:
| metric (base → fine-tuned) | lite | medium | high |
|---|---|---|---|
| mean_reward | 3.29 → 3.21 | 3.42 → 3.31 | 3.44 → 3.26 |
| success_rate | 0.50 → 0.53 | 0.63 → 0.53 | 0.57 → 0.47 |
| pass@k | 0.50 → 0.60 | 0.60 → 0.70 | 0.70 → 0.50 |
| valid_action | 0.94 → 0.93 | 0.96 → 0.92 | 0.94 → 0.93 |
| faithfulness | 3.33 → 3.23 | 3.17 → 3.32 | 3.17 → 3.53 |
Generally, mixed results, which makes sense because Gemma 4 models are generalized and saturated for these types of problems. Perhaps if I expanded my eval problem sets, worked harder on the training data set, or had more time I could get better results, but I think this is a good enough demonstration for a hackathon. In the end, these SFT models scored about as well on everything with some slight upsides so we'll use these in the app.
How do these models do within the gradio space? Here's a example of our agenda item RAG (lite):
And here's an agent run on the packet (also lite):
I also published a number of examples of these fine-tuned model traces with these types of tasks here.
What I've Learned and the Future of Small Models
By far the biggest difference with models from the early 2020s is the ability to use large coding agents. As the flagship models become more powerful, the loop of training, testing, and applying these distilled smaller models becomes significantly faster. My general takeaway is to think bigger as to what a demo/hackathon project could be, because the tooling is good enough and getting better.
One can also abstract the problem of this app: find any source/api/data you'd like to do inference over, build some one-shot workflows with the LLM (in our case we did a RAG), build and/or use an existing harness and build agent workflows with a set of domain-constrained tools for the common problems you face, and voila there's an AI app that can reasonably fit on a consumer GPU. I've seen a lot of success with larger models with this type of setup and I'm excited to see smaller models be a real alternative for local workflows.
I'm looking forward to a near-future where we get "good enough" coding agents that can run on 16GB GPUs and the applications will be immediate across a number of domains.
Links
- Try the app live: build-small-hackathon/agenda-parser
- Everything in one place — the Space, the three fine-tuned models, and both trace datasets: the Agenda Parser collection



