Complementary LLM evaluation for models used in time series + forecasting tasks

#21
by vigneshwar234 - opened

Hi Salesforce Research team πŸ‘‹

GIFT-Eval's time series forecasting benchmark is comprehensive. For LLMs being evaluated as components in forecasting pipelines (prompt-based forecasting, explanation generation), I built a complementary text evaluation framework.

LLM Evaluation Framework covers the LLM text-side with 5 metrics:

β†’ 🧠 Reasoning Quality β€” for LLMs explaining forecast reasoning, CoT depth matters
β†’ πŸ” Hallucination Rate β€” fabricated trend explanations undermine trust in forecasts
β†’ πŸ’° Cost per 1K tokens β€” LLMs in forecasting pipelines run frequently, cost adds up
β†’ ⚑ Latency p95 β€” real-time forecasting applications need bounded latency
β†’ 🎯 Accuracy β€” text task accuracy on structured benchmarks

Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework

Would love to discuss LLM evaluation in the context of time series forecasting pipelines!

Sign up or log in to comment