EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition Paper • 2505.20033 • Published May 26, 2025 • 4
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection Paper • 2506.09827 • Published Jun 11, 2025 • 24
KletterMix: Climbing Toward High-Quality German Pretraining Data Paper • 2606.03773 • Published 11 days ago • 18
GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction Paper • 2605.10108 • Published May 11 • 1
Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling Paper • 2604.28075 • Published Apr 30 • 20
ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering Paper • 2510.09351 • Published Oct 10, 2025
AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities Paper • 2508.04118 • Published Aug 6, 2025
FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale Paper • 2601.22146 • Published Jan 29 • 12
SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing Paper • 2512.11192 • Published Dec 12, 2025 • 1
SindBERT, the Sailor: Charting the Seas of Turkish NLP Paper • 2510.21364 • Published Oct 24, 2025 • 2
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper • 2510.13996 • Published Oct 15, 2025 • 9
UniFusion: Vision-Language Model as Unified Encoder in Image Generation Paper • 2510.12789 • Published Oct 14, 2025 • 19
view post Post 741 What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king? See translation 1 reply · 👀 1 1 + Reply
Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian Paper • 2509.05668 • Published Sep 6, 2025 • 8
view post Post 1180 Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages. See translation 👍 3 3 + Reply
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering Paper • 2503.14996 • Published Mar 19, 2025 • 3
How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions Paper • 2506.16679 • Published Jun 20, 2025 • 2
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models Paper • 2505.22232 • Published May 28, 2025 • 18