arxiv:2606.12071

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

Published on Jun 10

· Submitted by

Soujanya Poria on Jun 10

Deep Cognition and Language Research (DeCLaRe) Lab

Upvote

Authors:

Abstract

Research questions generated by large language models exhibit inconsistent novelty assessments when compared to human experts, highlighting concerns about relying on LLMs for scientific novelty evaluation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.