Evaluate AI model predictions with correctness scores
Track, rank and evaluate open LLMs and chatbots