mdlm-owt-diff1 โ summary-conditioned MDLM (DIFF_1), 100k steps
DIFF_1 from the quentin-dlm cascade: a masked-diffusion LM finetuned from
kuleshov-group/mdlm-owt to
generate OpenWebText documents conditioned on a coarse summary prefix.
- Layout
[summary 256 | text 768]@ L1024; prefix always revealed (never masked); masked-CE NELBO on the text region only.time_conditioning=False. - 169.6M vendored Duo DiT backbone, GPT-2 tokenizer, vocab 50258
(
[MASK]=50257, pad=eos=50256). - Data:
EER6/openwebtext-coarse(doc_idx >= 2048; first 2048 held out). - Recipe: 100k steps, global batch 384 (8x GH200 DDP), lr 3e-4 cosine (warmup 500), AdamW(0.9, 0.95), wd 0, bf16, EMA 0.99.
- These are the EMA weights of checkpoint-100000 (DiT backbone state_dict,
same layout as mdlm-owt:
model.safetensorsat repo root).
Results / caveats: held-out val NELBO 2.996 (ppl 20.0) vs trash-prefix control 3.293 (26.9) โ strong conditioning (samples reproduce ~44% of summary content words, 5.5x the shuffled baseline). NOTE: the hot 100k finetune degraded sampling fluency (gen-PPL ~207 @512 steps vs ~59 for the base model); see RESULTS_MDLM_100K.md in the project repo for the full diagnosis (earlier checkpoints sample better; remasking samplers recommended).
Load (project code): duo_core.load_model("EER6/mdlm-owt-diff1", 1024, 50258, device)
or as --init_ckpt EER6/mdlm-owt-diff1 in train/train_big_mdlm.py.
Companion control: EER6/mdlm-owt-trash.
- Downloads last month
- 20