GuideLines for training urdu tokenizer for much larger corpus

by hadidev - opened Jun 19, 2022

Jun 19, 2022

HI, I'm trying to train tokenizer as well as BERT model for urdu data. Can you share your step by step used in this model

naeemkhan

Mar 6, 2023

🛠 Installation

Urduhack officially supports Python 3.6–3.7, and runs great on PyPy.

Installing with tensorflow cpu version.

$ pip install urduhack[tf]

Installing with tensorflow gpu version.

$ pip install urduhack[tf-gpu]

Usage

import urduhack

# Downloading models
urduhack.download()

nlp = urduhack.Pipeline()
text = ""
doc = nlp(text)

for sentence in doc.sentences:
    print(sentence.text)
    for word in sentence.words:
        print(f"{word.text}\t{word.pos}")

    for token in sentence.tokens:
        print(f"{token.text}\t{token.ner}")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment