FineMed Medical-Entity Extractor (FR)

FineMed

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

This is the medical-entity extractor used to compute the medical-term density axis of FineMed-fr. Given a French medical document, it extracts medical-term spans under an 8-class UMLS-adapted taxonomy (disease, drug, body_part, …). The density is then the ratio of characters inside the extracted spans to the document's total characters.

It is a GLiNER2 model (mDeBERTa-v3 backbone, 512-token context) fine-tuned on LLM annotations, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).

🚀 How to Use

from gliner2 import GLiNER2

extractor = GLiNER2.from_pretrained("doctolib-lab/finemed-entity-extractor-fr")

# 8-class taxonomy; passing descriptions (not just the keys) improves extraction
labels = {
    "disease": "Pathological condition: disease, syndrome, infection, cancer, injury, symptom, clinical finding, mental disorder",
    "drug": "Chemical substance for therapy: prescription medication, vaccine, therapeutic compound, drug class, contrast agent",
    "body_part": "Anatomical structure: organ, tissue, bone, muscle, blood vessel, nerve, cell, body fluid, anatomical region",
    "medical_procedure": "Clinical action with methodology: surgery, diagnostic test, medical examination, laboratory test, imaging procedure",
    "molecular_marker": "Molecular entity or biochemical substance: gene, protein, enzyme, receptor, genetic variant, biochemical analyte",
    "clinical_device": "Manufactured medical object: surgical tool, implant, prosthetic, diagnostic scanner, monitoring equipment",
    "vital_function": "Physiological parameter name: heart rate, blood pressure, respiratory rate, temperature, oxygen saturation",
    "living_beings": "Non-human organism in biomedical context: bacterium, virus, fungus, parasite, pathogen, model organism",
}

text = "Le patient présente une pneumonie traitée par amoxicilline ..."
results = extractor.batch_extract_entities([text], labels, threshold=0.5)
print(results[0]["entities"])
# {"disease": ["pneumonie"], "drug": ["amoxicilline"], ...}

To reproduce FineMed's medical_entity_density, run extraction over the middle 512 tokens of each document, then divide the characters covered by the extracted spans by the document's total character count. Taking the middle window skips boilerplate at the document boundaries and keeps corpus-scale inference tractable.

🏷️ Entity Taxonomy

8 classes adapted from UMLS, keeping the medical-term-rich groups:

class covers
disease disease, syndrome, infection, cancer, injury, symptom, clinical finding, mental disorder
drug prescription medication, vaccine, therapeutic compound, drug class, contrast agent
body_part organ, tissue, bone, muscle, blood vessel, nerve, cell, body fluid, anatomical region
medical_procedure surgery, diagnostic test, medical examination, laboratory test, imaging procedure
molecular_marker gene, protein, enzyme, receptor, genetic variant, biochemical analyte
clinical_device surgical tool, implant, prosthetic, diagnostic scanner, monitoring equipment
vital_function heart rate, blood pressure, respiratory rate, temperature, oxygen saturation
living_beings bacterium, virus, fungus, parasite, pathogen, model organism

🔧 Training

Fine-tuned from GLiNER2 on entity annotations produced by Qwen3-235B-A22B-Instruct via a two-pass self-review (Pass 1 extracts entities, Pass 2 reviews and corrects them) over roughly 300k documents. Best configuration: training prompts without per-class descriptions, inference prompts with descriptions. Entity-group order is shuffled during annotation to mitigate position bias. The two annotation prompts are in medical_entity_extract_prompt.txt (Pass 1) and medical_entity_review_prompt.txt (Pass 2).

⚠️ Intended Use & Limitations

Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. It is tuned for density estimation over a 512-token window, not exhaustive document-level entity recognition.

⚖️ License

Apache-2.0, inherited from the GLiNER2 base model.

🏛️ Acknowledgments

This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.

Downloads last month
8
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for doctolib-lab/finemed-entity-extractor-fr

Finetuned
(8)
this model

Collection including doctolib-lab/finemed-entity-extractor-fr

Paper for doctolib-lab/finemed-entity-extractor-fr