donut-base-ascii

This is "naver-clova-ix/donut-base" but with all non-ascii tokens removed. This means the model is good for basic English use cases where the text is primarily a-zA-Z0-9 and basic punctuation.

The original model, "naver-clova-ix/donut-base", did not have a token for "1", so that has also been added. The notebook remove-donut-tokens.ipynb details the whole process.

This has not been trained any more than the original model.

I made a whole video about it: https://youtu.be/Uzr553x1gdM

I did a quick speed test for generation against the default model and using bad_words_ids. The bad_words_ids was only 12k tokens instead of the 30k that were removed and it was still noticeably slower.

Speed script here
Launched with this

approach	time to generate 10 tokens
"naver-clova-ix/donut-base"	205ms
"naver-clova-ix/donut-base" + 12k `bad_words_ids`	280ms
"donut-base-ascii"	195ms

Downloads last month: 12

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using nbroad/donut-base-ascii 1

Collection including nbroad/donut-base-ascii

Document Models (Pretrained)

Collection

Various pretrained models for analyzing documents. These need to be fine-tuned for a task • 20 items • Updated May 7, 2024 • 1