You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Improving TrOCR Robustness under Unicode Diacritic Obfuscation

This model repository contains the fine-tuned checkpoint for a text recognition model optimized to defend automated hate speech detection pipelines against character-level diacritic attacks (Zalgo text obfuscation).

Model Details

Base Architecture: Microsoft TrOCR (microsoft/trocr-base-printed)
Task: Optical Character Recognition (OCR) / Vision Encoder-Decoder
Fine-Tuning Objective: Restoring text structural integrity from combining diacritical character variations to shield downstream NLP classifiers.
Academic Context: Developed as part of the EE-559 Deep Learning course project (2026) at EPFL.

Training Data & Lineage

The model was fine-tuned using word-level diacritic injection mappings paired with:

Dataset Source: google/civil_comments (released under CC0: Public Domain).
Base Model Credits: Microsoft's UniLM Project (microsoft/trocr-base-printed), distributed under the MIT License.

License & Citation

Following the licensing permissions of the base foundational architecture, this fine-tuned checkpoint is distributed under the MIT License.

If using this model or reproducing our pipeline evaluation layout, please ensure proper attribution to Microsoft's TrOCR architecture and the Jigsaw/Google Civil Comments curation teams.

Downloads last month: 23

Safetensors

Model size

0.3B params

Tensor type

F32

Mamaa2001
/

trocr-model-diacritic

You need to agree to share your contact information to access this model

Improving TrOCR Robustness under Unicode Diacritic Obfuscation

Model Details

Training Data & Lineage

License & Citation

Dataset used to train Mamaa2001/trocr-model-diacritic