FineMed Medical-Entity Extractor (FR)

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

📚 Introduction

This is the medical-entity extractor used to compute the medical-term density axis of FineMed-fr. Given a French medical document, it extracts medical-term spans under an 8-class UMLS-adapted taxonomy (disease, drug, body_part, …). The density is then the ratio of characters inside the extracted spans to the document's total characters.

It is a GLiNER2 model (mDeBERTa-v3 backbone, 512-token context) fine-tuned on LLM annotations, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density).

🚀 How to Use

from gliner2 import GLiNER2

extractor = GLiNER2.from_pretrained("doctolib-lab/finemed-entity-extractor-fr")

# 8-class taxonomy; passing descriptions (not just the keys) improves extraction
labels = {
    "disease": "Pathological condition: disease, syndrome, infection, cancer, injury, symptom, clinical finding, mental disorder",
    "drug": "Chemical substance for therapy: prescription medication, vaccine, therapeutic compound, drug class, contrast agent",
    "body_part": "Anatomical structure: organ, tissue, bone, muscle, blood vessel, nerve, cell, body fluid, anatomical region",
    "medical_procedure": "Clinical action with methodology: surgery, diagnostic test, medical examination, laboratory test, imaging procedure",
    "molecular_marker": "Molecular entity or biochemical substance: gene, protein, enzyme, receptor, genetic variant, biochemical analyte",
    "clinical_device": "Manufactured medical object: surgical tool, implant, prosthetic, diagnostic scanner, monitoring equipment",
    "vital_function": "Physiological parameter name: heart rate, blood pressure, respiratory rate, temperature, oxygen saturation",
    "living_beings": "Non-human organism in biomedical context: bacterium, virus, fungus, parasite, pathogen, model organism",
}

text = "Le patient présente une pneumonie traitée par amoxicilline ..."
results = extractor.batch_extract_entities([text], labels, threshold=0.5)
print(results[0]["entities"])
# {"disease": ["pneumonie"], "drug": ["amoxicilline"], ...}

To reproduce FineMed's medical_entity_density, run extraction over the middle 512 tokens of each document, then divide the characters covered by the extracted spans by the document's total character count. Taking the middle window skips boilerplate at the document boundaries and keeps corpus-scale inference tractable.

🏷️ Entity Taxonomy

8 classes adapted from UMLS, keeping the medical-term-rich groups:

class	covers
`disease`	disease, syndrome, infection, cancer, injury, symptom, clinical finding, mental disorder
`drug`	prescription medication, vaccine, therapeutic compound, drug class, contrast agent
`body_part`	organ, tissue, bone, muscle, blood vessel, nerve, cell, body fluid, anatomical region
`medical_procedure`	surgery, diagnostic test, medical examination, laboratory test, imaging procedure
`molecular_marker`	gene, protein, enzyme, receptor, genetic variant, biochemical analyte
`clinical_device`	surgical tool, implant, prosthetic, diagnostic scanner, monitoring equipment
`vital_function`	heart rate, blood pressure, respiratory rate, temperature, oxygen saturation
`living_beings`	bacterium, virus, fungus, parasite, pathogen, model organism

🔧 Training

Fine-tuned from GLiNER2 on entity annotations produced by Qwen3-235B-A22B-Instruct via a two-pass self-review (Pass 1 extracts entities, Pass 2 reviews and corrects them) over roughly 300k documents. Best configuration: training prompts without per-class descriptions, inference prompts with descriptions. Entity-group order is shuffled during annotation to mitigate position bias. The two annotation prompts are in medical_entity_extract_prompt.txt (Pass 1) and medical_entity_review_prompt.txt (Pass 2).

⚠️ Intended Use & Limitations

Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. It is tuned for density estimation over a 512-token window, not exhaustive document-level entity recognition.