On the Limits of LLM-as-Judge for Scientific Novelty Assessment
Abstract
Research questions generated by large language models exhibit inconsistent novelty assessments when compared to human experts, highlighting concerns about relying on LLMs for scientific novelty evaluation.
LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.
Community
Are LLM judges reliable for scientific novelty assessment?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement (2026)
- Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation (2026)
- PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers (2026)
- An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics (2026)
- Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering (2026)
- Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents? (2026)
- From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper