Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
Abstract
Retrieval-augmented generation systems exhibit source-dependent responses to identical queries, necessitating a shift from traditional correctness evaluation to analyzing inter-source relationships for multi-source NLP systems.
A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.
Community
v1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora (2026)
- Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study (2026)
- DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering (2026)
- ASTRA-QA: A Benchmark for Abstract Question Answering over Documents (2026)
- BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection (2026)
- Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering (2026)
- Same Ranking, Different Winner: How Scoring Targets Shape LLM Memory Benchmarks (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.29084 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper