BEREL-seg: TBD

State-of-the-art language model for Rabbinic Hebrew, released [here] - add link.

This model is fine-tuned from BEREL_3.0 for the prefix segmentation task.

Sample usage:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/BEREL-seg')
model = AutoModel.from_pretrained('dicta-il/BEREL-seg', trust_remote_code=True)

model.eval()

sentence = 'וזה לשון הרמב״ן בפירושו על התורה, שהדבר ידוע ומפורסם לכל בעלי העיון שאין המקרא יוצא מידי פשוטו אף על פי שהדרש אמת.'

print(model.predict([sentence], tokenizer))

Output:

[
  [
    [ "[CLS]" ],
    [ "ו", "זה" ],
    [ "לשון" ],
    [ "ה", "רמב\"ן" ],
    [ "ב", "פירושו" ],
    [ "על" ],
    [ "ה", "תורה" ],
    [ ", " ],
    [ "שהד", "בר" ],
    [ "ידוע" ],
    [ "ו", "מפורסם" ],
    [ "ל", "כל" ],
    [ "בעלי" ],
    [ "ה", "עיון" ],
    [ "ש", "אין" ],
    [ "ה", "מקרא" ],
    [ "יוצא" ],
    [ "פשוטו" ],
    [ "אף" ],
    [ "על" ],
    [ "פי" ],
    [ "שהד", "רש" ],
    [ "אמת" ],
    [ "." ],
    [ "[SEP]" ]
  ]
]