MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Paper • 2506.13585 • Published • 275
Interpreter LoRA for verbalizing behaviors encoded in fine-tuning weight deltas. Trained via offline CISPO (MiniMax-M1, arXiv:2506.13585) with Dr. GRPO advantages (arXiv:2503.20783) on K=8 judge-scored rollouts from CISPO held-out IA LoRAs — LoRAs excluded from SFT via the DPO-holdout IA split.
| Set | pass@N | rollout-mean |
|---|---|---|
| AuditBench (56 orgs) | 73.2% | 47.9% |
| heldout_ia_v2 (20 orgs) | 80.0% | 73.3% |
| ood_models_v3 (23 orgs) | 47.8% | 12.8% |
-sg(clip(rho)) * A * log pi_theta, summed over tokens, divided by sum |o_i|A = score - mean(score) (no std-division, no length-norm)Feed direction tokens (shape [4480, 5120], svd_fixed_k16_mag7_rankfirst format,
bf16) through the residual AOEncoder, inject at layer-1 output at placeholder
positions, apply this interpreter over frozen Qwen/Qwen3-14B, decode greedily.