Indic-Bert-NER-Model
A fine-tuned Named Entity Recognition (NER) model based on ai4bharat/indic-bert for extracting medical and regulatory entities from Indian language documents.
Model Details
Overview
This model is fine-tuned for NER tasks on medical and regulatory documents, specifically for identifying entities in adverse event reports and regulatory submissions. It extends the multilingual Indic-BERT base model with specialized training on pharmaceutical and medical regulatory terminology.
Model Architecture
- Base Model: ai4bharat/indic-bert
- Task: Token Classification (Named Entity Recognition)
- Languages Supported: Indian languages (Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, and others)
- Framework: PyTorch / Transformers
Model Specifications
- Model Type: BERT for token classification
- Tokenizer: SentencePiece
- Max Sequence Length: 512 tokens
- Hidden Size: 768
- Number of Attention Heads: 12
- Number of Layers: 12
Training Data
The model was trained on the Indic-Bert-NER-BIO-Dataset, which includes:
- Annotated medical and pharmaceutical regulatory documents
- Multiple data sources: CTRI, FAERS, JSL datasets
- Phase 2 augmented and merged datasets for improved robustness
- BIO (Begin-Inside-Outside) tagged entities
For detailed dataset information, see: Indic-Bert-NER-BIO-Dataset
Supported Entity Tags
The model recognizes the following entity categories:
- Medical Entities: Drug names, diseases, medical conditions
- Regulatory Entities: Dosages, routes of administration, adverse events
- Document Entities: Document types, regulatory references
Complete entity taxonomy available in the dataset repository.
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load model and tokenizer
model_name = "sharkdodo/Indic-Bert-NER-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create NER pipeline
ner_pipeline = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
# Example text
text = "This drug is made from paracetamol and is used for headache treatment."
# Perform NER
results = ner_pipeline(text)
print(results)
Advanced Usage with Custom Labels
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "sharkdodo/Indic-Bert-NER-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "The drug dosage is 500 milligrams daily."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Get predictions
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Map predictions to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions[0].numpy()):
print(f"{token}: {id2label[pred]}")
Batch Processing
from transformers import pipeline
ner = pipeline(
"token-classification",
model="redpanda/Indic-Bert-NER-Model",
aggregation_strategy="simple"
)
texts = [
"Paracetamol is commonly used to treat headaches and fever.",
"Take Ibuprofen 400 milligrams tablet for pain relief."
]
results = [ner(text) for text in texts]
for text, entities in zip(texts, results):
print(f"Text: {text}")
print(f"Entities: {entities}\n")
Model Card
Model Use
Intended Use: Named Entity Recognition for medical and regulatory documents in Indian languages.
Primary Users:
- Healthcare professionals
- Regulatory compliance teams
- Medical document processors
- Adverse event monitoring systems
Limitations
- Model trained primarily on English-transliterated Indian languages and Hindi
- Performance may vary on regional language variations
- Best performance on well-formatted documents
- Trained on specific pharmaceutical and regulatory domain
Ethical Considerations
- Use only for legitimate regulatory and medical purposes
- Ensure data privacy compliance when processing sensitive health information
- Do not use for automated decision-making in clinical settings without human review
- Respect patient confidentiality and HIPAA/DPDP compliance
License
This model is released under the MIT License.
MIT License
Copyright (c) 2026 Vivek Molleti
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
Citation
If you use this model in your research or application, please cite:
@model{indic_bert_ner_2026,
title = {Indic-Bert-NER-Model},
author = {Vivek Molleti},
year = {2026},
url = {https://huggingface.co/sharkdodo/Indic-Bert-NER-Model},
note = {Fine-tuned from AI4Bharat's Indic-BERT}
}
Related Resources
- Base Model: AI4Bharat Indic-BERT
- Dataset: Indic-Bert-NER-BIO-Dataset
Changelog
Version 1.0 (April 2026)
- Initial release
- Fine-tuned on Phase 2 augmented dataset
- Support for Indian languages via Indic-BERT base
- Downloads last month
- 88
Model tree for sharkdodo/Indic-Bert-NER-Model
Base model
ai4bharat/indic-bert