Indic-Bert-NER-Model

A fine-tuned Named Entity Recognition (NER) model based on ai4bharat/indic-bert for extracting medical and regulatory entities from Indian language documents.

Model Details

Overview

This model is fine-tuned for NER tasks on medical and regulatory documents, specifically for identifying entities in adverse event reports and regulatory submissions. It extends the multilingual Indic-BERT base model with specialized training on pharmaceutical and medical regulatory terminology.

Model Architecture

  • Base Model: ai4bharat/indic-bert
  • Task: Token Classification (Named Entity Recognition)
  • Languages Supported: Indian languages (Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, and others)
  • Framework: PyTorch / Transformers

Model Specifications

  • Model Type: BERT for token classification
  • Tokenizer: SentencePiece
  • Max Sequence Length: 512 tokens
  • Hidden Size: 768
  • Number of Attention Heads: 12
  • Number of Layers: 12

Training Data

The model was trained on the Indic-Bert-NER-BIO-Dataset, which includes:

  • Annotated medical and pharmaceutical regulatory documents
  • Multiple data sources: CTRI, FAERS, JSL datasets
  • Phase 2 augmented and merged datasets for improved robustness
  • BIO (Begin-Inside-Outside) tagged entities

For detailed dataset information, see: Indic-Bert-NER-BIO-Dataset

Supported Entity Tags

The model recognizes the following entity categories:

  • Medical Entities: Drug names, diseases, medical conditions
  • Regulatory Entities: Dosages, routes of administration, adverse events
  • Document Entities: Document types, regulatory references

Complete entity taxonomy available in the dataset repository.

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
model_name = "sharkdodo/Indic-Bert-NER-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Example text
text = "This drug is made from paracetamol and is used for headache treatment."

# Perform NER
results = ner_pipeline(text)
print(results)

Advanced Usage with Custom Labels

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "sharkdodo/Indic-Bert-NER-Model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "The drug dosage is 500 milligrams daily."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get predictions
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

# Map predictions to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for token, pred in zip(tokens, predictions[0].numpy()):
    print(f"{token}: {id2label[pred]}")

Batch Processing

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="redpanda/Indic-Bert-NER-Model",
    aggregation_strategy="simple"
)

texts = [
    "Paracetamol is commonly used to treat headaches and fever.",
    "Take Ibuprofen 400 milligrams tablet for pain relief."
]

results = [ner(text) for text in texts]
for text, entities in zip(texts, results):
    print(f"Text: {text}")
    print(f"Entities: {entities}\n")

Model Card

Model Use

Intended Use: Named Entity Recognition for medical and regulatory documents in Indian languages.

Primary Users:

  • Healthcare professionals
  • Regulatory compliance teams
  • Medical document processors
  • Adverse event monitoring systems

Limitations

  • Model trained primarily on English-transliterated Indian languages and Hindi
  • Performance may vary on regional language variations
  • Best performance on well-formatted documents
  • Trained on specific pharmaceutical and regulatory domain

Ethical Considerations

  • Use only for legitimate regulatory and medical purposes
  • Ensure data privacy compliance when processing sensitive health information
  • Do not use for automated decision-making in clinical settings without human review
  • Respect patient confidentiality and HIPAA/DPDP compliance

License

This model is released under the MIT License.

MIT License

Copyright (c) 2026 Vivek Molleti

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

Citation

If you use this model in your research or application, please cite:

@model{indic_bert_ner_2026,
  title = {Indic-Bert-NER-Model},
  author = {Vivek Molleti},
  year = {2026},
  url = {https://huggingface.co/sharkdodo/Indic-Bert-NER-Model},
  note = {Fine-tuned from AI4Bharat's Indic-BERT}
}

Related Resources

Changelog

Version 1.0 (April 2026)

  • Initial release
  • Fine-tuned on Phase 2 augmented dataset
  • Support for Indian languages via Indic-BERT base
Downloads last month
88
Safetensors
Model size
32.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sharkdodo/Indic-Bert-NER-Model

Finetuned
(40)
this model

Dataset used to train sharkdodo/Indic-Bert-NER-Model