Spaces:
Running
Complementary LLM evaluation for models used in time series + forecasting tasks
Hi Salesforce Research team π
GIFT-Eval's time series forecasting benchmark is comprehensive. For LLMs being evaluated as components in forecasting pipelines (prompt-based forecasting, explanation generation), I built a complementary text evaluation framework.
LLM Evaluation Framework covers the LLM text-side with 5 metrics:
β π§ Reasoning Quality β for LLMs explaining forecast reasoning, CoT depth matters
β π Hallucination Rate β fabricated trend explanations undermine trust in forecasts
β π° Cost per 1K tokens β LLMs in forecasting pipelines run frequently, cost adds up
β β‘ Latency p95 β real-time forecasting applications need bounded latency
β π― Accuracy β text task accuracy on structured benchmarks
Live demo: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
Would love to discuss LLM evaluation in the context of time series forecasting pipelines!