Papers
arxiv:2605.17314

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Published on May 17
Authors:

Abstract

Off-policy experience from a smaller, domain-trained model improves a stronger learner's performance on mathematical reasoning tasks when combined with Dr. GRPO training, outperforming standard on-policy methods and achieving state-of-the-art results on MATH-500 and AIME benchmarks.

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields +1.62pp on MATH-500 (greedy pass@1) over the matched-wrong variant (n=10 seeds, p=0.0015, Welch's t). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@k. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@k above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from k=1 to k=1024 across 2 seeds (+14.2pp on 2025 and +9.0pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.17314
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 4

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.17314 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.