Anim Vid AI: Turning Any Concept Into a Manim Animation With Fine-Tuned Small Models

Community Article
Published June 15, 2026

"Anim Vid AI — Describe any concept. Get a beautiful Manim animation. No Manim experience needed."

Try it here: Anim Vid AI on Hugging Face


Everyone who has ever tried to explain a math formula or algorithm knows the feeling — words only go so far. Manim is the gold standard for visualizing these concepts, but writing Manim code by hand is a skill of its own. Anim Vid AI removes that barrier entirely. Describe what you want. Get a rendered animation back.


How It Works

The app uses a fine-tuned Qwen3-14B model to generate complete, working Manim Python code from a plain English description. A fine-tuned Qwen3-0.6B draft model runs alongside it as a speculative decoder — proposing tokens in bulk while the 14B model verifies them, accepting correct drafts instantly and only doing full computation where the draft was wrong. The result is noticeably faster generation with zero quality loss.

The architecture is split across two Gradio Spaces deliberately:

  • This Space handles all AI inference — model loading, LoRA adapters, speculative decoding, code streaming
  • A companion Space (anim-vid-api-mcp) handles all Manim rendering — keeping the Manim environment and its dependencies fully isolated

This keeps each concern independent and makes the renderer reusable as a standalone API by other apps.


The Hard Part: Dataset Curation

This was the real challenge of the project — and the part that took the most engineering.

Public Manim code exists, but not at the scale or consistency a fine-tune actually needs. What's out there rarely targets a specific rendering environment, mixes syntax versions freely, and has no guarantee of running without errors.

To solve this, roughly 10,000 Manim Python scripts were synthetically generated using Claude Sonnet 4.5, seeded from topics collected during a prior Manim animation project. This gave full control over quality and consistency. Every script in the dataset:

  • Follows ManimCE v0.19 syntax throughout
  • Names the scene class MyScene
  • Ends with self.wait()
  • Covers a broad range of subjects — geometry, calculus, physics, algebra, algorithms, and more

Both LoRA adapters — the 14B and the 0.6B — were trained on this identical dataset, which is what makes speculative decoding work well here. When both models are trained on the same distribution, the draft model's token proposals closely match what the main model would generate, leading to a high acceptance rate and real speed gains.


What the LoRA Actually Fixed

Before fine-tuning, even with an explicit system prompt, base Qwen3 models produced unreliable Manim output. The LoRA consistently resolved all of these:

Behaviour Base Model (no LoRA) Fine-Tuned with LoRA
Output format Sometimes returned plain text instead of code Always returns a valid Python code block
Code structure Ignored MyScene naming and self.wait() even when told explicitly Consistently follows the required structure
Library compatibility Used standard Manim APIs that conflict with the rendering Space Generates code compatible with the rendering environment
Syntax version Mixed deprecated methods and older ManimCE syntax Stays within ManimCE v0.19 throughout
Axis configuration Used axis_color= directly, throwing errors Correctly uses axis_config={'color': ...}

The base model knew Manim. The fine-tune knew this Manim — for this renderer, this environment, this structure.


Training Setup

Both models were fine-tuned on Modal using an A100 GPU, completing in approximately 3 hours total.

Parameter Value
Epochs 3
Learning Rate 6e-5
LR Scheduler linear
Warmup Ratio 0.03
Optimizer adamw_torch_fused
Per Device Train Batch Size 4
Gradient Accumulation Steps 32
Effective Batch Size 128 (4 × 32)
Validation Split 10%
Eval Steps every 50 steps
Precision bf16
Seed 42

Tech Stack


What I Learned

The synthetic dataset approach was the right call — and the lesson was that dataset quality beats dataset size. A clean, consistent 10k dataset trained both adapters reliably. Messy public data at 10x the scale would have produced a model that half-works.

The two-Space architecture was also the right call. Keeping Manim's rendering environment isolated from the inference Space meant each side could be debugged, updated, and reused independently. The renderer is already usable as a standalone API — other apps can call it directly.

When you fine-tune both the main model and the draft model on the same data, speculative decoding stops being a gamble and starts being a genuine speed layer.

Try it here: Anim Vid AI on Hugging Face

Community

Sign up or log in to comment