Evaluation tool for web research LLMs: accuracy + hallucination + cost

#1
by vigneshwar234 - opened

Hi OSU NLP team ๐Ÿ‘‹

QUEST's multi-source web research approach is impressive. For teams deploying web research LLMs, hallucination is the most critical failure mode โ€” fabricated citations are worse than no answer.

I built an open source LLM Evaluation Framework with a dedicated hallucination metric:

โ†’ ๐Ÿ” Hallucination Rate โ€” detects ungrounded claims, runs locally on any output
โ†’ ๐ŸŽฏ Accuracy โ€” verified against ground truth
โ†’ ๐Ÿง  Reasoning Quality โ€” CoT depth, important for research-style multi-step answers
โ†’ ๐Ÿ’ฐ Cost per 1K tokens โ€” web research tasks are token-heavy
โ†’ โšก Latency p95

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Open source. Free forever. Happy to discuss web research LLM evaluation approaches!

Sign up or log in to comment