Spaces:
Running
Evaluation tool for web research LLMs: accuracy + hallucination + cost
Hi OSU NLP team ๐
QUEST's multi-source web research approach is impressive. For teams deploying web research LLMs, hallucination is the most critical failure mode โ fabricated citations are worse than no answer.
I built an open source LLM Evaluation Framework with a dedicated hallucination metric:
โ ๐ Hallucination Rate โ detects ungrounded claims, runs locally on any output
โ ๐ฏ Accuracy โ verified against ground truth
โ ๐ง Reasoning Quality โ CoT depth, important for research-style multi-step answers
โ ๐ฐ Cost per 1K tokens โ web research tasks are token-heavy
โ โก Latency p95
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Open source. Free forever. Happy to discuss web research LLM evaluation approaches!