Would you be willing to share information on the training?

#12

pinned

by mosthu - opened 5 days ago

Hello, I have an interest in finetuning Z Image / Z Anime for images with specific styles + contents. I am not finding much information online about how to approach this. Would you be willing to share information on Z Anime's training process? Training parameters, dataset size, input image resolution, etc. ?

Either way, thank you for sharing this model with the world

SeeSee21

Owner 4 days ago

Hey mosthu! 👋

Happy to share my experience — but a heads-up first: there's no single "best" set of training parameters. There are rules of thumb and rough values you can lean on, but if you're planning to train your own model, a lot of it will come down to testing.

A bit of preface 🧠

I've always been fascinated by how a model actually learns. How does it know a tree is green, or that the sky is blue during the day, or what a house looks like — and why does it know a house needs windows and doors?

The answer is simple and complex at the same time: it learns from the images, and from what's supposed to be visible on them, through the captions. That's why the descriptions for your dataset have to match the images precisely. More on that later.

My journey before Z-Anime

Chroma was my first real shot:
https://civitai.red/models/2022057/chroma-anime-aio

That one was a checkpoint merge — I trained a LoRA on an FP8 Chroma checkpoint using OneTrainer with the default config and ~1,500 images, then merged it in ComfyUI. I like having everything baked into one checkpoint, so I also wrote a script that merges the text encoder and VAE into a single file. Lucky for me, ComfyUI loaded it without any adjustments.

I tried a V2 a few times, but the bigger my dataset got, the more it forgot what it had previously learned. What I didn't know back then was that you can actively counter this — keywords: learning rate and EMA.

Next stop — Qwen Image, first proper fine-tune:
https://civitai.red/models/2135240/qwen-anime-official-workflow?modelVersionId=2540517

OneTrainer again, but tweaked for my server hardware: 2× NVIDIA P40 Tesla cards with 24GB VRAM each, and 512GB DDR4 RAM. Sounds like a lot? Depends how patient you are. With less VRAM you can absolutely train models this size — I even got it to start training on my gaming PC with an old 4060 Ti / 8GB VRAM 😅. Just be ready to wait.

Back to Qwen: V1 and V2 were ~5K images, V3 added image-to-image to keep the edit functionality alive, also 5K — so 10K total for V3.

My OneTrainer workflow

I always set it to 100 epochs and just watch the training. When the metrics feel right and I think it's ready to test, I stop and run 100 standard prompts in ComfyUI, look at the results, and evaluate them. A lot of this is personal preference — how I want the images to look.

Style not there yet? → Train more.
Fingers/faces still off? → Train more.
Anatomy problems or weird artifacts? → Something's off with your settings (LR etc.) — adjust.

The most important rule: you want a model that generates, not copies ⚠️. Some models get really good at copying — if it can reproduce training images one-to-one, you've already trained too far and the model is losing variety. At that point I either change parameters or roll back to an earlier saved checkpoint.

Z-Image

https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

Up front: forget trying to train the Turbo version directly — it won't work. What does work is training a solid base LoRA, e.g. with AI-Toolkit:
https://github.com/ostris/ai-toolkit

Lots of example videos there showing how to train and what works on which hardware.

Same rules: more images → smaller learning rate (but not too small), LoRA rank has to be adjusted, and step count too. For fine-tuning, a model should see each image at least 50 times — ideally 100. So you scale steps, batch size, etc. accordingly. And again: generation, not copying.

That's exactly what I did here:
https://civitai.red/models/2259646/z-image-turbo-anime

Lots of LoRAs, lots of testing until I had something I liked. 10K anime images, 50K steps. I don't remember 100% if I used the very last checkpoint — I think it was one before. I save every 500 steps.

Bonus tip: there are scripts that let you adjust the strength of specific regions within a LoRA — background look, anatomy, etc. I did exactly that, and that's how Z-Image-Turbo-Anime came out.

Now, Z-Anime 🎯

How did I train it? OneTrainer, full fine-tune — not a LoRA merge. It's a slightly modified version so I get more logs for evaluation.

I started with 500 images, trained 3 epochs at a time, then compared results. For the automated eval and restart logic I used autoresearch:
https://github.com/karpathy/autoresearch

It auto-stopped training and restarted with different values. The one thing I locked in was sticking with CONSTANT for the loss weighting — all checkpoints trained with that. Why change something that's been evaluated as good?

Here's the config (not 100% sure this is the very latest, but it's close):

{
    "learning_rate": 1e-05,
    "learning_rate_warmup_steps": 1000.0,
    "learning_rate_cycles": 1.0,
    "learning_rate_min_factor": 0.0,
    "epochs": 100,
    "batch_size": 4,
    "gradient_accumulation_steps": 1,
    "ema": "OFF",
    "ema_decay": 0.999,
    "ema_update_step_interval": 5,
    "dataloader_threads": 1,
    "train_device": "cuda",
    "temp_device": "cpu",
    "train_dtype": "BFLOAT_16",
    "fallback_train_dtype": "BFLOAT_16",
    "enable_autocast_cache": true,
    "only_cache": false,
    "resolution": "768",
    "frames": "25",
    "mse_strength": 1.0,
    "mae_strength": 0.0,
    "log_cosh_strength": 0.0,
    "huber_strength": 0.0,
    "huber_delta": 1.0,
    "vb_loss_strength": 1.0,
    "loss_weight_fn": "CONSTANT",
    "loss_weight_strength": 5.0
}

Dataset: 15K images at 768 resolution. Why not 1024? Because I didn't want to blow up my hardware — VRAM was already at the limit. I'd rather take a bigger batch size than higher resolution.

For Z-Anime I ended up picking epoch 35 as my base.

The 4-step and 8-step versions

Those are LoRA merges using pre-existing 4-step and 8-step LoRAs for Z-Anime. I deliberately didn't call them "Turbo" — other people will release versions like that too, and "Distill" makes it clearer to most folks what they're actually getting.

Bonus: building an 8-step from scratch 🔬

Can you make an 8-step version without a LoRA merge? Yes, if you understand how the Turbo version was created. The Turbo versions are trained with DMD2:
https://github.com/Tongyi-MAI/Z-Image/issues/56

You take your trained base and build your own DMD2 trainer — about 2,000 lines of code. Pretty doable, and an AI like ChatGPT or Claude can absolutely help.

That said, today I wouldn't go that route anymore. There's already a much better turbo-style method out — just no model trained with it yet because it's too new: CMD
https://github.com/byliutao/cdm

Much better, and even works in 4 steps. I'm running first tests with Anisee, an unreleased checkpoint:
https://anisee.anisee.workers.dev/

I needed a small base, so I did an Anima fine-tune with the same 15K dataset, and that's where I'm testing CMD before bringing it to Z-Anime — Anisee is much smaller, so iteration is faster. Once the trainer is done, the actual training will probably take 2+ weeks.

The dataset — the most important part 📦

Golden rule: quality over quantity — and yet you still need a lot of images. For comparison: a 2B model from scratch typically needs 500K to 2 million images. Z-Image is a 6B model. So in raw numbers, 15K isn't much for the whole model — honestly, it's too little if you want to cover multiple styles.

Diversity matters a lot. Simple example:

You have one pose, same character, same background, same camera angle, repeated → that's bad. Way better: that pose with 10 different characters, 10 different camera angles, 10 different backgrounds. Now you're not just training the pose, you're also training characters, angles, and backgrounds. And that's just a tiny part — you can broaden your captions much further: clothing, era metadata like "90s style", and so on.

Same rule applies: if it's not in the image, don't describe it. Same for emotions or what might be happening — the model doesn't reason about what could happen, it only ever sees the image. But also, don't leave things out either.

Back to the house example. If your captions say "house with windows" 10 times but the house in the image has no windows, the model will only ever generate houses without windows — no matter how often you write "windows" in the prompt.

Another one: 10 houses with windows, 10 without, but your captions don't mention windows at all → the model gets confused during training. Results might be unsure: sometimes broken windows, sometimes no windows, sometimes both 😅. Simple examples, but the point is: captions are the biggest part of training. There are plenty of tools and configs with good training settings now, but your dataset has to be right.

Forget auto-captioning with AI ❌

Forget using an AI to write image descriptions out of the box. Most models just hallucinate without a very precise system prompt + function call setup — and even the big ones from OpenAI or Google can't accurately describe what's actually on the image. If you want your model to do text later, you can use them as OCR, but for the rest I'd do it all manually.

I wrote a small program with Claude that has 10–15 categories with standard sentences and variations. That way I can see what's still missing from a description and what's already there. That's how I caption everything.

TL;DR / Summary 📝

No magic settings. There are rules of thumb, but your dataset, target style, and base model decide the rest.
Goal: generation, not copying. If your model reproduces training images 1:1, you've overtrained — roll back or adjust.
Z-Anime config: OneTrainer full fine-tune, 15K images @ 768, LR 1e-5, batch 4, BF16, CONSTANT loss weighting, epoch 35 picked as base.
4/8-step versions: LoRA merges using DMD2-trained LoRAs. CMD is the new better path forward.
Hardware: 2× P40 24GB + 512GB RAM — but you can train on far less if you have patience.
Captions = the biggest lever. Dataset diversity > raw image count. Don't describe what isn't there, don't leave out what is.
Skip auto-captioners. Build a structured manual workflow instead.

What's next on my end

I think I'm done with Z-Anime for now — V2 not decided yet. Current dataset sits at 65K captioned images and growing.

Projects in flight:

Anisee (testing CMD on it first)
CMD-based turbo training
A fully from-scratch model using Qwen 3.5 base as text encoder and the Qwen Image Edit VAE — testing image-to-image and OCR-to-image there.

Closing word 🙌

So as you can see — there's no single "best" setting. There are recommendations and rules of thumb you can lean on, but what you actually dial in is determined by your dataset, what you want to train, how much of it, and which base model you start from. Hope this helped you a bit — and anyone else reading along. Good luck with your fine-tune!

SeeSee21 pinned discussion 4 days ago

mosthu

3 days ago

This helps a lot, thank you so much!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment