ThingAI/Quark-50m · Comprehensive Evaluation Results

Comprehensive Evaluation Results

by GODELEV - opened 1 day ago

TO: ThingAI Team
DATE: June 3, 2026
SUBJECT: Comprehensive Evaluation Results & Model Card Update for ThingAI/Quark-50m

I wanted to reach out and congratulate you on the development of ThingAI/Quark-50m. Designing an ultra-compact, 50-million parameter model requires incredible optimization, and it is fascinating to see how well it preserves core reasoning and linguistic patterns at this scale.

I have completed a thorough evaluation suite across multiple standard language, reasoning, and academic knowledge benchmarks. The results below outline its performance and offer a clean, structured breakdown of its foundational capabilities.

Benchmark Evaluation Metrics

Category	Benchmark	Metric	Score / Value	Status
Linguistics & Grammar	BLiMP	Accuracy	68.12%	Success
Commonsense & Reasoning	PIQA	Normalized Accuracy	57.83%	Success
	COPA	Accuracy	57.00%	Success
	BoolQ	Accuracy	52.17%	Success
	WinoGrande	Accuracy	47.36%	Success
	HellaSwag	Normalized Accuracy	28.49%	Success
	RACE	Accuracy	26.41%	Success
	CommonsenseQA	Accuracy	20.31%	Success
Academic & Knowledge	SciQ	Normalized Accuracy	49.00%	Success
	ARC-Easy	Normalized Accuracy	36.49%	Success
	MMLU	Accuracy	25.64%	Success
	ARC-Challenge	Normalized Accuracy	25.17%	Success
	OpenBookQA	Normalized Accuracy	25.40%	Success
Language Modeling	LAMBADA	Accuracy	15.87%	Success
	WikiText-2	Word Perplexity	251.76	Success

Notes on Failed Tasks: The Arithmetic benchmark failed due to outdated script support (arithmetic.py), and SocialIQA failed due to a registration tag error (siqa). Total baseline execution completed successfully for all other 15 tasks.

Key Takeaways & Recommendation

For an extremely lean 50M parameter footprint, Quark-50m demonstrates remarkably stable linguistic grounding, highlighted by its 68.12% on BLiMP and 57.83% on PIQA. It stands as an excellent candidate for highly constrained edge deployments and tiny-LLM research.

To help researchers fully appreciate its capabilities relative to its size, I highly recommend integrating this clean benchmark table into your official Hugging Face model card. Providing these transparent metrics will greatly assist the open-source community in tracking the efficiency gains of compact architectures. You can list your model on "AxiomicLabs/Open_SLM_Leaderboard" .

Kudos again on an excellent model release! Let me know if you would like any additional details regarding the evaluation environment setup.

Best regards,
Akshit

ThingsAI

ThingAI org about 23 hours ago

Hi Akshit,
Thank you so much for taking the time to run these comprehensive benchmarks on Quark-50m! I really appreciate the detailed breakdown and the feedback.
For a 50M parameter model, I'm quite happy to see such stable linguistic grounding on BLiMP and PIQA. It’s great to have these objective metrics to see exactly where it holds up and where the scaling limits hit.
I will definitely follow your recommendation and integrate this evaluation table into the Hugging Face model card so the community can have full transparency on its performance. I’ll also look into listing it on the AxiomicLabs leaderboard as you suggested!
Thanks again for the support and for contributing to the open-source tiny-LLM research.
Best regards,
Michelangelo Di Nicola (ThingAI Team)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Comprehensive Evaluation Results

TO: ThingAI TeamDATE: June 3, 2026SUBJECT: Comprehensive Evaluation Results & Model Card Update for ThingAI/Quark-50m

Benchmark Evaluation Metrics

Key Takeaways & Recommendation

TO: ThingAI Team
DATE: June 3, 2026
SUBJECT: Comprehensive Evaluation Results & Model Card Update for ThingAI/Quark-50m