mdlm-owt-trash — trash-prefix control MDLM, 100k steps

DIFF_1 from the quentin-dlm cascade: a masked-diffusion LM finetuned from kuleshov-group/mdlm-owt to generate OpenWebText documents after a constant filler prefix (positional-control baseline for DIFF_1).

Layout [summary 256 | text 768] @ L1024; prefix = constant [BOS]+[PAD]x254+[EOS], always revealed (never masked); masked-CE NELBO on the text region only. time_conditioning=False.
169.6M vendored Duo DiT backbone, GPT-2 tokenizer, vocab 50258 ([MASK]=50257, pad=eos=50256).
Data: EER6/openwebtext-coarse (doc_idx >= 2048; first 2048 held out).
Recipe: 100k steps, global batch 384 (8x GH200 DDP), lr 3e-4 cosine (warmup 500), AdamW(0.9, 0.95), wd 0, bf16, EMA 0.99.
These are the EMA weights of checkpoint-100000 (DiT backbone state_dict, same layout as mdlm-owt: model.safetensors at repo root).

Results / caveats: held-out val NELBO 3.293 (ppl 26.9) vs DIFF_1's 2.996 (20.0) — by design this model's generations show no summary dependence (content-word overlap ratio 1.0x vs DIFF_1's 5.5x); it isolates the positional handicap. NOTE: the hot 100k finetune degraded sampling fluency (gen-PPL ~207 @512 steps vs ~59 for the base model); see RESULTS_MDLM_100K.md in the project repo for the full diagnosis (earlier checkpoints sample better; remasking samplers recommended).

Load (project code): duo_core.load_model("EER6/mdlm-owt-trash", 1024, 50258, device) or as --init_ckpt EER6/mdlm-owt-trash in train/train_big_mdlm.py.

Companion control: EER6/mdlm-owt-trash.

Downloads last month: 25

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support