Anim Vid AI: Turning Any Concept Into a Manim Animation With Fine-Tuned Small Models
"Anim Vid AI — Describe any concept. Get a beautiful Manim animation. No Manim experience needed."
Try it here: Anim Vid AI on Hugging Face
Everyone who has ever tried to explain a math formula or algorithm knows the feeling — words only go so far. Manim is the gold standard for visualizing these concepts, but writing Manim code by hand is a skill of its own. Anim Vid AI removes that barrier entirely. Describe what you want. Get a rendered animation back.
How It Works
The app uses a fine-tuned Qwen3-14B model to generate complete, working Manim Python code from a plain English description. A fine-tuned Qwen3-0.6B draft model runs alongside it as a speculative decoder — proposing tokens in bulk while the 14B model verifies them, accepting correct drafts instantly and only doing full computation where the draft was wrong. The result is noticeably faster generation with zero quality loss.
The architecture is split across two Gradio Spaces deliberately:
- This Space handles all AI inference — model loading, LoRA adapters, speculative decoding, code streaming
- A companion Space (anim-vid-api-mcp) handles all Manim rendering — keeping the Manim environment and its dependencies fully isolated
This keeps each concern independent and makes the renderer reusable as a standalone API by other apps.
The Hard Part: Dataset Curation
This was the real challenge of the project — and the part that took the most engineering.
Public Manim code exists, but not at the scale or consistency a fine-tune actually needs. What's out there rarely targets a specific rendering environment, mixes syntax versions freely, and has no guarantee of running without errors.
To solve this, roughly 10,000 Manim Python scripts were synthetically generated using Claude Sonnet 4.5, seeded from topics collected during a prior Manim animation project. This gave full control over quality and consistency. Every script in the dataset:
- Follows ManimCE v0.19 syntax throughout
- Names the scene class
MyScene - Ends with
self.wait() - Covers a broad range of subjects — geometry, calculus, physics, algebra, algorithms, and more
Both LoRA adapters — the 14B and the 0.6B — were trained on this identical dataset, which is what makes speculative decoding work well here. When both models are trained on the same distribution, the draft model's token proposals closely match what the main model would generate, leading to a high acceptance rate and real speed gains.
What the LoRA Actually Fixed
Before fine-tuning, even with an explicit system prompt, base Qwen3 models produced unreliable Manim output. The LoRA consistently resolved all of these:
| Behaviour | Base Model (no LoRA) | Fine-Tuned with LoRA |
|---|---|---|
| Output format | Sometimes returned plain text instead of code | Always returns a valid Python code block |
| Code structure | Ignored MyScene naming and self.wait() even when told explicitly |
Consistently follows the required structure |
| Library compatibility | Used standard Manim APIs that conflict with the rendering Space | Generates code compatible with the rendering environment |
| Syntax version | Mixed deprecated methods and older ManimCE syntax | Stays within ManimCE v0.19 throughout |
| Axis configuration | Used axis_color= directly, throwing errors |
Correctly uses axis_config={'color': ...} |
The base model knew Manim. The fine-tune knew this Manim — for this renderer, this environment, this structure.
Training Setup
Both models were fine-tuned on Modal using an A100 GPU, completing in approximately 3 hours total.
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Learning Rate | 6e-5 |
| LR Scheduler | linear |
| Warmup Ratio | 0.03 |
| Optimizer | adamw_torch_fused |
| Per Device Train Batch Size | 4 |
| Gradient Accumulation Steps | 32 |
| Effective Batch Size | 128 (4 × 32) |
| Validation Split | 10% |
| Eval Steps | every 50 steps |
| Precision | bf16 |
| Seed | 42 |
Tech Stack
- Main Model:
unsloth/Qwen3-14B+ovi054/qwen3-14b-manim-lora - Draft Model:
Qwen/Qwen3-0.6B+ovi054/qwen3-0.6b-manim-lora - Rendering Backend:
anim-vid-api-mcp - Training Infrastructure: Modal (A100 GPU)
- Frontend: Gradio
What I Learned
The synthetic dataset approach was the right call — and the lesson was that dataset quality beats dataset size. A clean, consistent 10k dataset trained both adapters reliably. Messy public data at 10x the scale would have produced a model that half-works.
The two-Space architecture was also the right call. Keeping Manim's rendering environment isolated from the inference Space meant each side could be debugged, updated, and reused independently. The renderer is already usable as a standalone API — other apps can call it directly.
When you fine-tune both the main model and the draft model on the same data, speculative decoding stops being a gamble and starts being a genuine speed layer.
Try it here: Anim Vid AI on Hugging Face