GIFT-Eval

Running

App Files Files Community

Complementary LLM evaluation for models used in time series + forecasting tasks

#21

by vigneshwar234 - opened 23 days ago

Discussion

vigneshwar234

23 days ago

Hi Salesforce Research team 👋

GIFT-Eval's time series forecasting benchmark is comprehensive. For LLMs being evaluated as components in forecasting pipelines (prompt-based forecasting, explanation generation), I built a complementary text evaluation framework.

LLM Evaluation Framework covers the LLM text-side with 5 metrics:

→ 🧠 Reasoning Quality — for LLMs explaining forecast reasoning, CoT depth matters
→ 🔍 Hallucination Rate — fabricated trend explanations undermine trust in forecasts
→ 💰 Cost per 1K tokens — LLMs in forecasting pipelines run frequently, cost adds up
→ ⚡ Latency p95 — real-time forecasting applications need bounded latency
→ 🎯 Accuracy — text task accuracy on structured benchmarks

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Would love to discuss LLM evaluation in the context of time series forecasting pipelines!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment