townboy/kpfbert-kdpii

Korean PII token-classification model fine-tuned from KPF/KPF-bert-ner on a KDPII-style dialogue dataset.

Dataset

Source file: 연대1_PII_dataset_V3.json
Documents: 4981
Sentences: 53778
Positive PII sentences: 19037
Label count: 33

Training Setup

Max length: 128
Epochs: 4.0
Learning rate: 2e-05
Train batch size: 8
Eval batch size: 8
Device: cuda
GPU: NVIDIA GeForce RTX 4060 Ti
Mixed precision: auto
Gradient checkpointing: True

Intended Use

This model is intended for Korean personally identifiable information detection in dialogue-like text. Typical labels include names, nicknames, account numbers, mobile numbers, emails, addresses, IDs, and related sensitive entities.

Quick Inference

from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="townboy/kpfbert-kdpii",
    aggregation_strategy="simple",
)

print(pipe("Phone 010-8661-5573, ID wanderingrabbit1"))

Notes

The classification head is reinitialized for the KDPII label space.
This checkpoint should be validated on your target product traffic before production use.

Downloads last month: 30

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for townboy/kpfbert-kdpii

Base model

KPF/KPF-bert-ner

Finetuned

(1)

this model