Indic-Bert-NER-Model

A fine-tuned Named Entity Recognition (NER) model based on ai4bharat/indic-bert for extracting medical and regulatory entities from Indian language documents.

Model Details

Overview

This model is fine-tuned for NER tasks on medical and regulatory documents, specifically for identifying entities in adverse event reports and regulatory submissions. It extends the multilingual Indic-BERT base model with specialized training on pharmaceutical and medical regulatory terminology.

Model Architecture

Base Model: ai4bharat/indic-bert
Task: Token Classification (Named Entity Recognition)
Languages Supported: Indian languages (Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, and others)
Framework: PyTorch / Transformers

Model Specifications

Model Type: BERT for token classification
Tokenizer: SentencePiece
Max Sequence Length: 512 tokens
Hidden Size: 768
Number of Attention Heads: 12
Number of Layers: 12

Training Data

The model was trained on the Indic-Bert-NER-BIO-Dataset, which includes:

Annotated medical and pharmaceutical regulatory documents
Multiple data sources: CTRI, FAERS, JSL datasets
Phase 2 augmented and merged datasets for improved robustness
BIO (Begin-Inside-Outside) tagged entities

For detailed dataset information, see: Indic-Bert-NER-BIO-Dataset

Supported Entity Tags

The model recognizes the following entity categories:

Medical Entities: Drug names, diseases, medical conditions
Regulatory Entities: Dosages, routes of administration, adverse events
Document Entities: Document types, regulatory references

Complete entity taxonomy available in the dataset repository.

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
model_name = "sharkdodo/Indic-Bert-NER-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Example text
text = "This drug is made from paracetamol and is used for headache treatment."

# Perform NER
results = ner_pipeline(text)
print(results)

Advanced Usage with Custom Labels

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "sharkdodo/Indic-Bert-NER-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "The drug dosage is 500 milligrams daily."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get predictions
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Map predictions to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for token, pred in zip(tokens, predictions[0].numpy()):
    print(f"{token}: {id2label[pred]}")

Batch Processing

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="redpanda/Indic-Bert-NER-Model",
    aggregation_strategy="simple"
)

texts = [
    "Paracetamol is commonly used to treat headaches and fever.",
    "Take Ibuprofen 400 milligrams tablet for pain relief."
]

results = [ner(text) for text in texts]
for text, entities in zip(texts, results):
    print(f"Text: {text}")
    print(f"Entities: {entities}\n")

Model Card

Model Use

Intended Use: Named Entity Recognition for medical and regulatory documents in Indian languages.

Primary Users:

Healthcare professionals
Regulatory compliance teams
Medical document processors
Adverse event monitoring systems

Limitations

Model trained primarily on English-transliterated Indian languages and Hindi
Performance may vary on regional language variations
Best performance on well-formatted documents
Trained on specific pharmaceutical and regulatory domain

Ethical Considerations

Use only for legitimate regulatory and medical purposes
Ensure data privacy compliance when processing sensitive health information
Do not use for automated decision-making in clinical settings without human review
Respect patient confidentiality and HIPAA/DPDP compliance

License

This model is released under the MIT License.

MIT License

Copyright (c) 2026 Vivek Molleti

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

Citation

If you use this model in your research or application, please cite:

@model{indic_bert_ner_2026,
  title = {Indic-Bert-NER-Model},
  author = {Vivek Molleti},
  year = {2026},
  url = {https://huggingface.co/sharkdodo/Indic-Bert-NER-Model},
  note = {Fine-tuned from AI4Bharat's Indic-BERT}
}

Related Resources

Base Model: AI4Bharat Indic-BERT
Dataset: Indic-Bert-NER-BIO-Dataset

Changelog

Version 1.0 (April 2026)

Initial release
Fine-tuned on Phase 2 augmented dataset
Support for Indian languages via Indic-BERT base

Downloads last month: 88

Safetensors

Model size

32.9M params

Tensor type

F32

Model tree for sharkdodo/Indic-Bert-NER-Model

Base model

ai4bharat/indic-bert

Finetuned

(40)

this model

sharkdodo
/

Indic-Bert-NER-Model