privacy-filter-nemotron
Fine-tuned openai/privacy-filter
for fine-grained PII extraction across 55 categories from
nvidia/Nemotron-PII.
- Base model:
openai/privacy-filterβ 1.4B-parameter MoE (50M active per token), BIOES token-classification head - Task: Token classification for PII detection (BIOES scheme)
- Training data: Full 100K rows of
nvidia/Nemotron-PIItrain split - Held-out val: 10K label-stratified rows from the Nemotron
testsplit (every label has β₯229 entities) - Recipe:
opf train(OpenAI's official fine-tuning CLI) β full fine-tune, AdamW, lr=1e-4, 5 epochs, bf16, weight decay 0.0 - Labels: 55 fine-grained PII categories β 221 BIOES classes (1
O+ 55 Γ B/I/E/S)
The base model ships with 8 coarse PII categories (private_person,
private_email, etc.). This model trades that coarse vocabulary for a
5Γ more granular one β first_name, last_name, medical_record_number,
credit_debit_card, ssn, and so on β matching what downstream redaction
and masking pipelines typically need.
Family at a glance. Same architecture, three runtimes:
- PyTorch (this repo) β CPU + CUDA, anywhere transformers runs.
- MLX BF16 β
OpenMed/privacy-filter-nemotron-mlxβ Apple Silicon, full precision.- MLX 8-bit β
OpenMed/privacy-filter-nemotron-mlx-8bitβ Apple Silicon, ~1.7Γ faster.
Quick start
With OpenMed β recommended
OpenMed gives you extract_pii() / deidentify() with built-in BIOES Viterbi
decoding, span refinement, and a Faker-backed obfuscation engine. Same call
on every host β Apple Silicon picks up MLX automatically; everywhere else uses
this PyTorch checkpoint.
pip install -U "openmed[hf]"
from openmed import extract_pii, deidentify
text = (
"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
"phone 415-555-0123, email sarah.johnson@example.com."
)
# Extract grouped entity spans
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron")
for ent in result.entities:
print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
# De-identify with any of the supported methods
masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-nemotron")
removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-nemotron")
hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-nemotron")
# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
fake = deidentify(
text,
method="replace",
model_name="OpenMed/privacy-filter-nemotron",
consistent=True,
seed=42,
)
print(fake.deidentified_text)
OpenMed/privacy-filter-nemotron-mlx* model names also work in the same
extract_pii() / deidentify() calls β on a non-Apple-Silicon host they
automatically fall back to this PyTorch checkpoint with a one-time
warning. So you can ship MLX names in code and still run on Linux/Windows.
The OpenMed wrapper passes trust_remote_code=True for you, runs the
model's own BIOES Viterbi decoder, and skips OpenMed's regex
smart-merging (the model already produces clean spans).
With opf β OpenAI's official CLI
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
opf redact \
--checkpoint OpenMed/privacy-filter-nemotron \
--text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
With transformers directly
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
model_id = "OpenMed/privacy-filter-nemotron"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(
model_id, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda")
model.eval()
text = "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
enc = tok(text, return_tensors="pt").to("cuda")
with torch.no_grad():
out = model(**enc).logits.argmax(-1).cpu()[0].tolist()
id2label = {int(k): v for k, v in model.config.id2label.items()}
tokens = tok.convert_ids_to_tokens(enc["input_ids"][0].cpu().tolist())
for t, l in zip(tokens, out):
if l != 0:
print(f"{t}\t{id2label[l]}")
For best results use Viterbi decoding (not argmax) β both opf and OpenMed
do this by default. If you're doing argmax with the HF transformers API, you'll
see slightly more boundary errors but still excellent label accuracy.
Performance
Evaluated with opf eval --decode-mode viterbi --eval-mode typed --span-metrics-space char
on the 10K label-stratified held-out val from nvidia/Nemotron-PII:test.
Headline
| Metric | Value |
|---|---|
| Macro B-F1 (across 55 labels) | 0.9533 |
| Token accuracy | 0.9910 |
| Strong labels (F1 β₯ 0.90) | 46 / 55 |
| Acceptable (F1 0.70β0.89) | 7 / 55 |
| Weak (F1 < 0.70) | 0 / 55 |
Per-label F1 (B-tag, sorted)
| Label | Precision | Recall | F1 |
|---|---|---|---|
π’ mac_address |
1.000 | 1.000 | 1.000 |
π’ biometric_identifier |
0.999 | 0.998 | 0.999 |
π’ bank_routing_number |
0.995 | 0.999 | 0.997 |
π’ credit_debit_card |
0.999 | 0.993 | 0.996 |
π’ ipv6 |
0.992 | 1.000 | 0.996 |
π’ health_plan_beneficiary_number |
1.000 | 0.990 | 0.995 |
π’ coordinate |
0.994 | 0.996 | 0.995 |
π’ ipv4 |
0.993 | 0.996 | 0.994 |
π’ url |
0.989 | 0.999 | 0.994 |
π’ email |
0.994 | 0.993 | 0.994 |
π’ date_of_birth |
0.992 | 0.994 | 0.993 |
π’ medical_record_number |
0.997 | 0.989 | 0.993 |
π’ street_address |
0.996 | 0.989 | 0.993 |
π’ vehicle_identifier |
0.986 | 0.996 | 0.991 |
π’ license_plate |
0.987 | 0.993 | 0.990 |
π’ customer_id |
0.995 | 0.984 | 0.990 |
π’ http_cookie |
0.992 | 0.983 | 0.988 |
π’ employee_id |
0.987 | 0.988 | 0.988 |
π’ account_number |
0.992 | 0.982 | 0.987 |
π’ certificate_license_number |
0.989 | 0.984 | 0.987 |
π’ swift_bic |
0.975 | 0.998 | 0.987 |
π’ postcode |
0.991 | 0.981 | 0.986 |
π’ api_key |
0.980 | 0.990 | 0.985 |
π’ password |
0.999 | 0.968 | 0.983 |
π’ tax_id |
1.000 | 0.965 | 0.982 |
π’ device_identifier |
0.974 | 0.988 | 0.981 |
π’ national_id |
0.991 | 0.961 | 0.976 |
π’ last_name |
0.977 | 0.975 | 0.976 |
π’ date_time |
0.982 | 0.967 | 0.974 |
π’ first_name |
0.962 | 0.978 | 0.970 |
π’ pin |
0.973 | 0.967 | 0.970 |
π’ phone_number |
0.948 | 0.992 | 0.970 |
π’ county |
0.986 | 0.946 | 0.965 |
π’ employment_status |
0.960 | 0.968 | 0.964 |
π’ user_name |
0.959 | 0.964 | 0.961 |
π’ date |
0.967 | 0.955 | 0.961 |
π’ blood_type |
0.922 | 0.954 | 0.938 |
π’ country |
0.955 | 0.918 | 0.936 |
π’ ssn |
0.926 | 0.945 | 0.935 |
π’ education_level |
0.961 | 0.908 | 0.934 |
π’ sexuality |
0.908 | 0.956 | 0.931 |
π’ company_name |
0.967 | 0.894 | 0.929 |
π’ religious_belief |
0.912 | 0.941 | 0.926 |
π’ unique_id |
0.910 | 0.922 | 0.916 |
π’ political_view |
0.939 | 0.872 | 0.905 |
π’ fax_number |
0.978 | 0.841 | 0.904 |
π‘ city |
0.917 | 0.876 | 0.896 |
π‘ time |
0.933 | 0.802 | 0.863 |
π‘ race_ethnicity |
0.821 | 0.906 | 0.861 |
π‘ gender |
0.967 | 0.744 | 0.841 |
π‘ state |
0.878 | 0.785 | 0.829 |
π‘ language |
0.889 | 0.735 | 0.804 |
π‘ occupation |
0.799 | 0.667 | 0.727 |
Label space (55 categories)
| Category | Typical examples |
|---|---|
| Identity | first_name, last_name, user_name, age, gender, race_ethnicity, sexuality, religious_belief, political_view, marital_status, nationality, education_level, occupation, employment_status, language, blood_type, biometric_identifier |
| Contact | email, phone_number, fax_number, url |
| Address | street_address, city, county, state, country, postcode, coordinate |
| Dates | date, date_of_birth, date_time, time |
| Government IDs | ssn, national_id, tax_id |
| Financial | account_number, bank_routing_number, swift_bic, credit_debit_card, cvv, pin, password |
| Healthcare | medical_record_number, health_plan_beneficiary_number |
| Enterprise IDs | customer_id, employee_id, unique_id, certificate_license_number |
| Vehicle | license_plate, vehicle_identifier |
| Digital | ipv4, ipv6, mac_address, device_identifier, api_key, http_cookie |
Head initialization: opf's default "copy-from-matching-base" head init.
Of the 221 new BIOES classes, 5 had exact matches in the base
(O, B/I/E/S-account_number); the other 216 were copied from
semantically-adjacent coarse rows and fine-tuned end-to-end.
Router: base model has 128 MoE experts per layer with top-4 routing. Routers were kept trainable during full fine-tuning; no collapse was observed.
Limitations & intended use
- English-only training data. Nemotron-PII is predominantly English with a 50/50 US/international locale split. Performance on non-English text is not guaranteed.
occupation,language,gender,state,race_ethnicity,political_view,education_levelare fuzzier categories than the strict identifiers β F1 lands in 0.65β0.89 vs 0.95+ for formatted identifiers. If your downstream only cares about strict PII, you can ignore low-confidence predictions on these.- Synthetic training data. Nemotron-PII is a synthesized dataset; real clinical notes, legal documents, and web text may show different surface forms. For high-stakes deployments, collect a domain-specific eval set and re-calibrate thresholds.
- Not a substitute for legal compliance review. Use alongside a governance layer (human review, deterministic regex pre-filters, etc.).
Credits & Acknowledgements
This model wouldn't exist without two open-source releases β sincere thanks to both teams:
- OpenAI for open-sourcing the Privacy Filter
(architecture, modeling code, and
opftraining/eval CLI). Everything in this repo is a fine-tune on top of that release. - NVIDIA for releasing the Nemotron-PII dataset with its 100K-row train split and 55 fine-grained PII labels.
Additional thanks to the HuggingFace team for the transformers /
huggingface_hub ecosystem this model ships through.
License
Apache 2.0, same as the base model.
Citation
If you use this model, please cite this model, the organization behind it (OpenMed), and the upstream base model + dataset:
@misc{openmed_privacy_filter_nemotron_2026,
author = {OpenMed},
title = {{OpenMed/privacy-filter-nemotron}: fine-grained PII extraction with 55 categories},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-nemotron}}
}
@misc{openmed_2026,
author = {OpenMed},
title = {{OpenMed}: open models and resources for healthcare NLP},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/OpenMed}}
}
@misc{openai_privacy_filter_2025,
author = {OpenAI},
title = {{openai/privacy-filter}},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
}
@misc{nemotron_pii_2025,
author = {NVIDIA},
title = {{Nemotron-PII}},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/nvidia/Nemotron-PII}}
}
- Downloads last month
- 553