Occiglot

community

https://occiglot.eu/

occiglot

Activity Feed

AI & ML interests

Open Source Language Models for Europe

Recent Activity

mkrausio authored a paper 9 days ago

EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

mkrausio authored a paper 9 days ago

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

mkrausio authored a paper 9 days ago

QuAnTS: Question Answering on Time Series

View all activity

mkrausio

authored 3 papers 9 days ago

EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

Paper • 2505.20033 • Published May 26, 2025 • 4

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Paper • 2506.09827 • Published Jun 11, 2025 • 24

QuAnTS: Question Answering on Time Series

Paper • 2511.05124 • Published Nov 7, 2025

mkrausio

authored a paper 10 days ago

KletterMix: Climbing Toward High-Quality German Pretraining Data

Paper • 2606.03773 • Published 11 days ago • 18

stefan-it

submitted 2 papers to Daily Papers about 1 month ago

GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction

Paper • 2605.10108 • Published May 11 • 1

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Paper • 2604.28075 • Published Apr 30 • 20

s-conia

authored 2 papers about 2 months ago

ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

Paper • 2510.09351 • Published Oct 10, 2025

AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities

Paper • 2508.04118 • Published Aug 6, 2025

stefan-it

submitted a paper to Daily Papers 4 months ago

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Paper • 2601.22146 • Published Jan 29 • 12

pjox

authored a paper 5 months ago

SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing

Paper • 2512.11192 • Published Dec 12, 2025 • 1

bjoernp

authored a paper 5 months ago

sui-1: Grounded and Verifiable Long-Form Summarization

Paper • 2601.08472 • Published Jan 13 • 3

stefan-it

authored 2 papers 8 months ago

SindBERT, the Sailor: Charting the Seas of Turkish NLP

Paper • 2510.21364 • Published Oct 24, 2025 • 2

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Paper • 2510.13996 • Published Oct 15, 2025 • 9

mbrack

authored a paper 8 months ago

UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Paper • 2510.12789 • Published Oct 14, 2025 • 19

BramVanroy

posted an update 8 months ago

Post

741

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

1 reply

stefan-it

authored a paper 9 months ago

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Paper • 2509.05668 • Published Sep 6, 2025 • 8

BramVanroy

posted an update 10 months ago

Post

1180

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

s-conia

authored a paper 12 months ago

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Paper • 2503.14996 • Published Mar 19, 2025 • 3

mbrack

authored a paper 12 months ago

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Paper • 2506.16679 • Published Jun 20, 2025 • 2

eliaswendt

authored a paper about 1 year ago

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Paper • 2505.22232 • Published May 28, 2025 • 18

AI & ML interests

Recent Activity

Team members 15

occiglot's activity