Domain-Adapted XLM-RoBERTa-Large for Old Swedish (SDHK)

This model is a domain-adapted version of XLM-RoBERTa-Large, pretrained using Masked Language Modelling (MLM) on the full Old Swedish corpus of the Svenskt Diplomatariums Huvudkartotek (SDHK), the main catalogue of the Swedish Diplomatarium maintained by the Swedish National Archives (Riksarkivet). The model is intended primarily as a foundation for downstream fine-tuning on Old Swedish text and is the base of phenningsson/sdhk-ner-old-swedish-v2, a NER model for Old Swedish charters.

Model Description

  • Model type: Masked language model (encoder, XLM-RoBERTa architecture)
  • Base model: FacebookAI/xlm-roberta-large
  • Language: Old Swedish (medieval Swedish charter language)
  • Adaptation method: Continued pre-training (Task-Adaptive Pre-Training, TAPT) on Old Swedish edition texts

The model retains the XLM-RoBERTa-Large tokenizer and architecture, but its encoder weights have been adapted to the orthographic, morphological, and lexical patterns of medieval Swedish charter language that are poorly represented in the multilingual XLM-RoBERTa pre-training data. For more information about this work, see phenningsson/sdhk-ner-old-swedish-v2, and the companion repository.

MLM hyperparameters:

  • Base model: xlm-roberta-large
  • Epochs: 8
  • Effective batch size: 32 (batch 2 × gradient accumulation 16)
  • Learning rate: 3e-5
  • Max sequence length: 256
  • Warm-up ratio: 6%
  • LR schedule: cosine
  • MLM masking probability: 15%

License

This model is released under the GNU General Public License v3.0 (GPL-3.0).

Contact

For questions or issues, please open an issue on the GitHub repository or contact: phenningsson@me.com

Downloads last month
14
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phenningsson/sdhk-mlm-pretrained-full

Finetuned
(951)
this model
Finetunes
1 model