Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL
Abstract
Reinforcement fine-tuning enhances language model reasoning while enabling calibrated abstention and clarification for unanswerable queries through a novel reward mechanism.
Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.
Community
RL fine-tuning makes LLMs better reasoners — but also bolder hallucinators on questions they can't actually answer. In this work, we argue a reliable model should abstain and pinpoint the missing information. We propose a clarification-aware RLVR reward that verifies whether post-refusal clarifications actually identify the key missing piece, and use it to train Abstain-R1 (3B). The model improves abstention + clarification quality while preserving performance on answerable queries. Model and benchmark released 🤗
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond"I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty (2026)
- Less Approximates More: Harmonizing Performance and Confidence Faithfulness via Hybrid Post-Training for High-Stakes Tasks (2026)
- Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents (2026)
- Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation (2026)
- Pause or Fabricate? Training Language Models for Grounded Reasoning (2026)
- MediX-R1: Open Ended Medical Reinforcement Learning (2026)
- Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.17073 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper