Improving TrOCR Robustness under Unicode Diacritic Obfuscation
This model repository contains the fine-tuned checkpoint for a text recognition model optimized to defend automated hate speech detection pipelines against character-level diacritic attacks (Zalgo text obfuscation).
Model Details
- Base Architecture: Microsoft TrOCR (
microsoft/trocr-base-printed) - Task: Optical Character Recognition (OCR) / Vision Encoder-Decoder
- Fine-Tuning Objective: Restoring text structural integrity from combining diacritical character variations to shield downstream NLP classifiers.
- Academic Context: Developed as part of the EE-559 Deep Learning course project (2026) at EPFL.
Training Data & Lineage
The model was fine-tuned using word-level diacritic injection mappings paired with:
- Dataset Source: google/civil_comments (released under
CC0: Public Domain). - Base Model Credits: Microsoft's UniLM Project (microsoft/trocr-base-printed), distributed under the
MIT License.
License & Citation
Following the licensing permissions of the base foundational architecture, this fine-tuned checkpoint is distributed under the MIT License.
If using this model or reproducing our pipeline evaluation layout, please ensure proper attribution to Microsoft's TrOCR architecture and the Jigsaw/Google Civil Comments curation teams.
- Downloads last month
- 23
Dataset used to train Mamaa2001/trocr-model-diacritic
Viewer • Updated • 2M • 9.66k • 32