Anim Vid AI: Turning Any Concept Into a Manim Animation With Fine-Tuned Small Models

Published June 15, 2026

How It Works
The Hard Part: Dataset Curation
What the LoRA Actually Fixed
Training Setup
Tech Stack
What I Learned
"Anim Vid AI — Describe any concept. Get a beautiful Manim animation. No Manim experience needed."

Try it here: Anim Vid AI on Hugging Face

Everyone who has ever tried to explain a math formula or algorithm knows the feeling — words only go so far. Manim is the gold standard for visualizing these concepts, but writing Manim code by hand is a skill of its own. Anim Vid AI removes that barrier entirely. Describe what you want. Get a rendered animation back.

How It Works

The app uses a fine-tuned Qwen3-14B model to generate complete, working Manim Python code from a plain English description. A fine-tuned Qwen3-0.6B draft model runs alongside it as a speculative decoder — proposing tokens in bulk while the 14B model verifies them, accepting correct drafts instantly and only doing full computation where the draft was wrong. The result is noticeably faster generation with zero quality loss.

The architecture is split across two Gradio Spaces deliberately:

This Space handles all AI inference — model loading, LoRA adapters, speculative decoding, code streaming
A companion Space (anim-vid-api-mcp) handles all Manim rendering — keeping the Manim environment and its dependencies fully isolated

This keeps each concern independent and makes the renderer reusable as a standalone API by other apps.

The Hard Part: Dataset Curation

This was the real challenge of the project — and the part that took the most engineering.

Public Manim code exists, but not at the scale or consistency a fine-tune actually needs. What's out there rarely targets a specific rendering environment, mixes syntax versions freely, and has no guarantee of running without errors.

To solve this, roughly 10,000 Manim Python scripts were synthetically generated using Claude Sonnet 4.5, seeded from topics collected during a prior Manim animation project. This gave full control over quality and consistency. Every script in the dataset:

Follows ManimCE v0.19 syntax throughout
Names the scene class MyScene
Ends with self.wait()
Covers a broad range of subjects — geometry, calculus, physics, algebra, algorithms, and more

Both LoRA adapters — the 14B and the 0.6B — were trained on this identical dataset, which is what makes speculative decoding work well here. When both models are trained on the same distribution, the draft model's token proposals closely match what the main model would generate, leading to a high acceptance rate and real speed gains.

What the LoRA Actually Fixed

Before fine-tuning, even with an explicit system prompt, base Qwen3 models produced unreliable Manim output. The LoRA consistently resolved all of these:

Behaviour	Base Model (no LoRA)	Fine-Tuned with LoRA
Output format	Sometimes returned plain text instead of code	Always returns a valid Python code block
Code structure	Ignored `MyScene` naming and `self.wait()` even when told explicitly	Consistently follows the required structure
Library compatibility	Used standard Manim APIs that conflict with the rendering Space	Generates code compatible with the rendering environment
Syntax version	Mixed deprecated methods and older ManimCE syntax	Stays within ManimCE v0.19 throughout
Axis configuration	Used `axis_color=` directly, throwing errors	Correctly uses `axis_config={'color': ...}`

The base model knew Manim. The fine-tune knew this Manim — for this renderer, this environment, this structure.

Training Setup

Both models were fine-tuned on Modal using an A100 GPU, completing in approximately 3 hours total.

Parameter	Value
Epochs	`3`
Learning Rate	`6e-5`
LR Scheduler	`linear`
Warmup Ratio	`0.03`
Optimizer	`adamw_torch_fused`
Per Device Train Batch Size	`4`
Gradient Accumulation Steps	`32`
Effective Batch Size	`128` (4 × 32)
Validation Split	`10%`
Eval Steps	every `50` steps
Precision	`bf16`
Seed	`42`

Tech Stack

Main Model: unsloth/Qwen3-14B + ovi054/qwen3-14b-manim-lora
Draft Model: Qwen/Qwen3-0.6B + ovi054/qwen3-0.6b-manim-lora
Rendering Backend: anim-vid-api-mcp
Training Infrastructure: Modal (A100 GPU)
Frontend: Gradio

What I Learned

The synthetic dataset approach was the right call — and the lesson was that dataset quality beats dataset size. A clean, consistent 10k dataset trained both adapters reliably. Messy public data at 10x the scale would have produced a model that half-works.

The two-Space architecture was also the right call. Keeping Manim's rendering environment isolated from the inference Space meant each side could be debugged, updated, and reused independently. The renderer is already usable as a standalone API — other apps can call it directly.

When you fine-tune both the main model and the draft model on the same data, speculative decoding stops being a gamble and starts being a genuine speed layer.

Try it here: Anim Vid AI on Hugging Face

Models mentioned in this article 2

Spaces mentioned in this article 2

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Noteworthy

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Anim Vid AI: Turning Any Concept Into a Manim Animation With Fine-Tuned Small Models

How It Works

The Hard Part: Dataset Curation

What the LoRA Actually Fixed

Training Setup

Tech Stack

What I Learned

Models mentioned in this article 2

Spaces mentioned in this article 2

Anim Vid AI

Anim Vid Api Mcp

Signal Garden: A Game Engine That Keeps Mutating

Noteworthy

Community

Models mentioned in this article 2

Spaces mentioned in this article 2

Anim Vid AI

Anim Vid Api Mcp