Papers
arxiv:2606.18717

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Published on Jun 17
· Submitted by
Tolga Şakar
on Jun 18
Authors:

Abstract

A neural morpheme-boundary model for Turkish achieves lossless tokenization and morphology-aware embeddings with improved efficiency and performance over traditional subword methods.

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

Community

Paper author Paper submitter
This comment has been hidden (marked as Spam)
This comment has been hidden (marked as Spam)

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.18717
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18717 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1