mdlm-owt-trash โ trash-prefix control MDLM, 100k steps
DIFF_1 from the quentin-dlm cascade: a masked-diffusion LM finetuned from
kuleshov-group/mdlm-owt to
generate OpenWebText documents after a constant filler prefix (positional-control baseline for DIFF_1).
- Layout
[summary 256 | text 768]@ L1024; prefix = constant [BOS]+[PAD]x254+[EOS], always revealed (never masked); masked-CE NELBO on the text region only.time_conditioning=False. - 169.6M vendored Duo DiT backbone, GPT-2 tokenizer, vocab 50258
(
[MASK]=50257, pad=eos=50256). - Data:
EER6/openwebtext-coarse(doc_idx >= 2048; first 2048 held out). - Recipe: 100k steps, global batch 384 (8x GH200 DDP), lr 3e-4 cosine (warmup 500), AdamW(0.9, 0.95), wd 0, bf16, EMA 0.99.
- These are the EMA weights of checkpoint-100000 (DiT backbone state_dict,
same layout as mdlm-owt:
model.safetensorsat repo root).
Results / caveats: held-out val NELBO 3.293 (ppl 26.9) vs DIFF_1's 2.996 (20.0) โ by design this model's generations show no summary dependence (content-word overlap ratio 1.0x vs DIFF_1's 5.5x); it isolates the positional handicap. NOTE: the hot 100k finetune degraded sampling fluency (gen-PPL ~207 @512 steps vs ~59 for the base model); see RESULTS_MDLM_100K.md in the project repo for the full diagnosis (earlier checkpoints sample better; remasking samplers recommended).
Load (project code): duo_core.load_model("EER6/mdlm-owt-trash", 1024, 50258, device)
or as --init_ckpt EER6/mdlm-owt-trash in train/train_big_mdlm.py.
Companion control: EER6/mdlm-owt-trash.
- Downloads last month
- 25