Typotopia
Building a generative type foundry: from a prompt to a working OpenType file.
What it is
Typotopia takes a single text prompt — or a control image — and returns a complete, installable OpenType font. No drawing, no kerning by hand, no font editor. One brief in, one .otf out, with real GPOS kerning baked in.
The trick is that "generating a font" is not one problem. It's three stacked problems that each fail in their own way: getting a diffusion model to draw consistent letterforms, turning pixels into clean vector outlines, and assembling those outlines into a file the OS will actually accept and space correctly. Most of the work was in the seams between them.
The model layer
The backbone is FLUX.2-klein (the 4B base) with three custom LoRA adapters, one per glyph category:
- typotopiaMAJ — uppercase
- typotopiaMIN — lowercase
- typotopiaPONCT — punctuation, digits, symbols
I trained all three with Ostris' ai-toolkit. Splitting the alphabet into three adapters instead of one was a deliberate decision: a single LoRA trying to hold caps, lowercase, and punctuation at once smeared the categories together — lowercase forms leaking proportions from the caps, punctuation drifting toward letter-shapes. Three narrower adapters each learned a tighter distribution, and the cost is only that inference runs three passes instead of one.
Each adapter draws a 6×6 grid — 36 cells, one glyph per cell, at 1536×1536. The grid layout is fixed and known ahead of time, which is what makes the next stage (cutting the grid back into individual glyphs) deterministic.
The hardest lesson at this layer was unglamorous: the LoRA only fires when its trigger word is in the prompt, and that's true regardless of guidance scale. I burned real time chasing "weak" outputs that were actually just the base model with the adapter sitting inert, because the trigger token was missing. Now every pass is prefixed — [typotopiaMAJ], … — and guidance is tuned per mode (lower for text prompts, higher when conditioning on a control image so the reference style actually takes).
Inference itself is short: 5 steps per pass. klein is fast enough that the bottleneck is never the diffusion — it's everything downstream.
The vectorization layer
A 6×6 image grid is not a font. Each cell has to become a clean vector contour.
The pipeline upscales each grid, slices it into 36 boxes by simple integer division, and thresholds each cell to a black mask. That mask goes through potrace to produce SVG paths, which fontTools then parses into glyph outlines.
potrace's parameters matter more than they look:
--turdsize 30 kills speckle noise that would otherwise become tiny stray contours inside a letter. --alphamax 0.9 and --opttolerance 0.5 control corner rounding vs. curve fitting — the difference between a crisp stem terminal and a melted one. Output coordinates get rounded to one decimal of precision. Past that, you're just storing diffusion noise as Bézier handles.
The font-construction layer
This is where fontTools does the heavy lifting, and where most of the "it's a real font" work lives.
Baselines are computed, not assumed. Each grid gets its own baseline derived from reference glyphs that sit flat on the line (H, I, E for caps; n, m, u for lowercase; digits for the punctuation grid). A diffusion model does not draw all 36 cells on a shared baseline, so anchoring per-grid off known-flat references is what keeps the rendered text from looking like a ransom note.
Coverage comes from composition, not generation. Rather than asking the model to draw every accented form, the builder composes them:
Mirrors — ) ] } are horizontal flips of ( [ {. Same for the bracket pairs, transformed at build time. Accents — é, à, ô, ñ etc. are assembled by stacking a base glyph and a diacritic, positioned by the base glyph's bounds plus a fixed offset. Combining-mark aliases — the standalone accent glyphs are also mapped to their combining codepoints, so the font behaves for both precomposed and decomposed input.
This is also where French/AZERTY support lives — the mirror and alias logic is what makes the output usable on my own keyboard without re-prompting the model for a dozen edge-case glyphs.
Reference metrics (ascent, descent, x-height, cap-height, line gap) are borrowed from Helvetica LT Std Regular. Not because the output looks like Helvetica, but because you need some sane vertical metrics for the OS to lay text out, and a known-good grotesque is a safe donor.
Bubble kerning
Auto-kerning was the most satisfying piece to get right, and it went through the most iterations.
The naive approach — sidebearings only — leaves AV, To, Wa looking gapped. Proper kerning needs to know how close two outlines actually get when set side by side. So:
Sample each glyph's contour — 300 points along the Bézier curves, in UPM coordinates. Critically, the scan covers the full vertical profile, top and bottom (Y_MIN = -200 to Y_MAX = 800). I learned the hard way that scanning only one band breaks italics and diagonals, where the tightest point can be anywhere. For every glyph pair, shift the right glyph by the left glyph's advance and compute the minimum Euclidean distance between all point pairs. The kern is diameter − min_dist, where diameter is twice a fixed BUBBLE_RADIUS of 10 UPM — imagine rolling a small bubble between the two letters and pulling them together until it's pinched. Pairs that don't pinch (kern above a small threshold) are skipped as noise; a floor of −280 prevents collisions. The pairs are written to GPOS, LookupType 2 (PairPos) — deliberately not the legacy kern format-0 table, which caps out around 4000 pairs. An all-to-all kerning matrix blows past that instantly. GPOS has no such limit.
And there you go! You can generate as many fonts as you want thanks to HF spaces with zeroGPU.
LoRAs: [https://huggingface.co/ChevalierJoseph/TYPOTOPIA_APP]