Comprehensive Evaluation Results
TO: ThingAI Team
DATE: June 3, 2026
SUBJECT: Comprehensive Evaluation Results & Model Card Update for ThingAI/Quark-50m
I wanted to reach out and congratulate you on the development of ThingAI/Quark-50m. Designing an ultra-compact, 50-million parameter model requires incredible optimization, and it is fascinating to see how well it preserves core reasoning and linguistic patterns at this scale.
I have completed a thorough evaluation suite across multiple standard language, reasoning, and academic knowledge benchmarks. The results below outline its performance and offer a clean, structured breakdown of its foundational capabilities.
Benchmark Evaluation Metrics
| Category | Benchmark | Metric | Score / Value | Status |
|---|---|---|---|---|
| Linguistics & Grammar | BLiMP | Accuracy | 68.12% | Success |
| Commonsense & Reasoning | PIQA | Normalized Accuracy | 57.83% | Success |
| COPA | Accuracy | 57.00% | Success | |
| BoolQ | Accuracy | 52.17% | Success | |
| WinoGrande | Accuracy | 47.36% | Success | |
| HellaSwag | Normalized Accuracy | 28.49% | Success | |
| RACE | Accuracy | 26.41% | Success | |
| CommonsenseQA | Accuracy | 20.31% | Success | |
| Academic & Knowledge | SciQ | Normalized Accuracy | 49.00% | Success |
| ARC-Easy | Normalized Accuracy | 36.49% | Success | |
| MMLU | Accuracy | 25.64% | Success | |
| ARC-Challenge | Normalized Accuracy | 25.17% | Success | |
| OpenBookQA | Normalized Accuracy | 25.40% | Success | |
| Language Modeling | LAMBADA | Accuracy | 15.87% | Success |
| WikiText-2 | Word Perplexity | 251.76 | Success |
Notes on Failed Tasks: The
Arithmeticbenchmark failed due to outdated script support (arithmetic.py), andSocialIQAfailed due to a registration tag error (siqa). Total baseline execution completed successfully for all other 15 tasks.
Key Takeaways & Recommendation
For an extremely lean 50M parameter footprint, Quark-50m demonstrates remarkably stable linguistic grounding, highlighted by its 68.12% on BLiMP and 57.83% on PIQA. It stands as an excellent candidate for highly constrained edge deployments and tiny-LLM research.
To help researchers fully appreciate its capabilities relative to its size, I highly recommend integrating this clean benchmark table into your official Hugging Face model card. Providing these transparent metrics will greatly assist the open-source community in tracking the efficiency gains of compact architectures. You can list your model on "AxiomicLabs/Open_SLM_Leaderboard" .
Kudos again on an excellent model release! Let me know if you would like any additional details regarding the evaluation environment setup.
Best regards,
Akshit
Hi Akshit,
Thank you so much for taking the time to run these comprehensive benchmarks on Quark-50m! I really appreciate the detailed breakdown and the feedback.
For a 50M parameter model, I'm quite happy to see such stable linguistic grounding on BLiMP and PIQA. It’s great to have these objective metrics to see exactly where it holds up and where the scaling limits hit.
I will definitely follow your recommendation and integrate this evaluation table into the Hugging Face model card so the community can have full transparency on its performance. I’ll also look into listing it on the AxiomicLabs leaderboard as you suggested!
Thanks again for the support and for contributing to the open-source tiny-LLM research.
Best regards,
Michelangelo Di Nicola (ThingAI Team)