Comprehensive Evaluation Results

#1
by GODELEV - opened

TO: ThingAI Team
DATE: June 3, 2026
SUBJECT: Comprehensive Evaluation Results & Model Card Update for ThingAI/Quark-50m

I wanted to reach out and congratulate you on the development of ThingAI/Quark-50m. Designing an ultra-compact, 50-million parameter model requires incredible optimization, and it is fascinating to see how well it preserves core reasoning and linguistic patterns at this scale.

I have completed a thorough evaluation suite across multiple standard language, reasoning, and academic knowledge benchmarks. The results below outline its performance and offer a clean, structured breakdown of its foundational capabilities.

Benchmark Evaluation Metrics

Category Benchmark Metric Score / Value Status
Linguistics & Grammar BLiMP Accuracy 68.12% Success
Commonsense & Reasoning PIQA Normalized Accuracy 57.83% Success
COPA Accuracy 57.00% Success
BoolQ Accuracy 52.17% Success
WinoGrande Accuracy 47.36% Success
HellaSwag Normalized Accuracy 28.49% Success
RACE Accuracy 26.41% Success
CommonsenseQA Accuracy 20.31% Success
Academic & Knowledge SciQ Normalized Accuracy 49.00% Success
ARC-Easy Normalized Accuracy 36.49% Success
MMLU Accuracy 25.64% Success
ARC-Challenge Normalized Accuracy 25.17% Success
OpenBookQA Normalized Accuracy 25.40% Success
Language Modeling LAMBADA Accuracy 15.87% Success
WikiText-2 Word Perplexity 251.76 Success

Notes on Failed Tasks: The Arithmetic benchmark failed due to outdated script support (arithmetic.py), and SocialIQA failed due to a registration tag error (siqa). Total baseline execution completed successfully for all other 15 tasks.


Key Takeaways & Recommendation

For an extremely lean 50M parameter footprint, Quark-50m demonstrates remarkably stable linguistic grounding, highlighted by its 68.12% on BLiMP and 57.83% on PIQA. It stands as an excellent candidate for highly constrained edge deployments and tiny-LLM research.

To help researchers fully appreciate its capabilities relative to its size, I highly recommend integrating this clean benchmark table into your official Hugging Face model card. Providing these transparent metrics will greatly assist the open-source community in tracking the efficiency gains of compact architectures. You can list your model on "AxiomicLabs/Open_SLM_Leaderboard" .

Kudos again on an excellent model release! Let me know if you would like any additional details regarding the evaluation environment setup.

Best regards,
Akshit

Hi Akshit,
Thank you so much for taking the time to run these comprehensive benchmarks on Quark-50m! I really appreciate the detailed breakdown and the feedback.
For a 50M parameter model, I'm quite happy to see such stable linguistic grounding on BLiMP and PIQA. It’s great to have these objective metrics to see exactly where it holds up and where the scaling limits hit.
I will definitely follow your recommendation and integrate this evaluation table into the Hugging Face model card so the community can have full transparency on its performance. I’ll also look into listing it on the AxiomicLabs leaderboard as you suggested!
Thanks again for the support and for contributing to the open-source tiny-LLM research.
Best regards,
Michelangelo Di Nicola (ThingAI Team)

Sign up or log in to comment