mmBERT Multilingual PII NER
A fine-tuned jhu-clsp/mmBERT-base model with a CRF layer for Personally Identifiable Information (PII) detection in multilingual dialogues across 11 languages.
Model Description
This model performs token-level Named Entity Recognition (NER) to identify and classify PII entities in dialogue text. It was trained on synthetic multilingual de-identification of conversational data.
- Architecture: mmBERT-base (ModernBERT) + CRF head
- Training: Fine-tuned on all 11 languages jointly (multilingual training)
- Loss: Cross-Entropy
- Hyperparameters: lr=2e-05, batch_size=32, max_length=512, dropout=0.1, epochs=10
- Decoding: Viterbi decoding via CRF layer
Supported Languages
| Code | Language |
|---|---|
| AR | Arabic |
| DE | German |
| EN | English |
| FI | Finnish |
| FR | French |
| HI | Hindi |
| IT | Italian |
| PL | Polish |
| PT | Portuguese |
| SP | Spanish |
| TR | Turkish |
Entity Types
The model recognizes 19 PII entity types using BIO tagging:
| Entity | Description |
|---|---|
PERSON |
Person names |
PERSON_EMAIL |
Email addresses |
PERSON_SOCIAL_RELATION |
Social relations (e.g., "my wife") |
ORG |
Organizations |
LOC_CITY |
Cities |
LOC_COUNTRY |
Countries |
LOC_STREET |
Street names |
LOC_ZIP |
ZIP/postal codes |
LOC_HOUSENUMBER |
House numbers |
LOC_OTHER |
Other locations |
DATETIME |
Dates and times |
DATETIME_AGE |
Ages |
CODE |
ID numbers, reference codes |
CODE_PHONE |
Phone numbers |
CODE_URL |
URLs |
PROFESSION |
Professions |
PRODUCT |
Product names |
QUANTITY |
Quantities |
MISC |
Miscellaneous PII |
Performance
Evaluated on held-out test sets per language (type-aware micro scores):
| Language | Lenient F1 | Lenient F2 | Exact F1 | Exact F2 |
|---|---|---|---|---|
| AR | 80.76 | 76.66 | 76.99 | 73.08 |
| DE | 91.66 | 90.71 | 90.54 | 89.60 |
| EN | 93.68 | 92.70 | 91.66 | 90.70 |
| FI | 87.70 | 86.77 | 85.65 | 84.73 |
| FR | 87.26 | 85.89 | 83.68 | 82.36 |
| HI | 84.94 | 82.91 | 81.26 | 79.31 |
| IT | 90.03 | 88.19 | 87.14 | 85.35 |
| PL | 89.33 | 89.45 | 86.17 | 86.29 |
| PT | 90.30 | 89.15 | 88.81 | 87.68 |
| SP | 91.39 | 90.76 | 89.62 | 89.00 |
| TR | 85.53 | 84.72 | 82.06 | 81.27 |
| AVG | 88.42 | 87.08 | 85.78 | 84.49 |
Usage
This model uses a custom CRF architecture and cannot be loaded directly with AutoModelForTokenClassification. You need to use the custom ModernBertCRF class.
Setup
import torch
import json
from transformers import AutoModel, AutoTokenizer
from torchcrf import CRF
import torch.nn as nn
class ModernBertCRF(nn.Module):
def __init__(self, base_model_name, num_labels, id2label, label2id):
super().__init__()
self.num_labels = num_labels
self.id2label = id2label
self.label2id = label2id
self.transformer = AutoModel.from_pretrained(base_model_name)
hidden_size = self.transformer.config.hidden_size
self.classifier = nn.Linear(hidden_size, num_labels)
self.dropout = nn.Dropout(0.1)
self.crf = CRF(num_labels, batch_first=True)
def forward(self, input_ids, attention_mask, labels=None, **kwargs):
kwargs.pop("token_type_ids", None)
outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
sequence_output = self.dropout(outputs.last_hidden_state)
emissions = self.classifier(sequence_output)
if labels is not None:
mask = attention_mask.bool()
labels_for_crf = labels.clone()
labels_for_crf[labels_for_crf == -100] = 0
loss = -self.crf(emissions, labels_for_crf, mask=mask, reduction='mean')
return {"loss": loss, "logits": emissions}
else:
return {"logits": emissions}
def decode(self, emissions, mask):
return self.crf.decode(emissions, mask=mask)
# Load model
model_dir = "deryaerman/mmbert_multilingual_pii_ner"
with open(f"{model_dir}/crf_config.json") as f:
config = json.load(f)
model = ModernBertCRF(
base_model_name=config["base_model_name"],
num_labels=config["num_labels"],
id2label=config["id2label"],
label2id=config["label2id"],
)
model.load_state_dict(torch.load(f"{model_dir}/pytorch_model.bin", map_location="cpu"))
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir)
id2label = {int(k): v for k, v in config["id2label"].items()}
Preprocessing: Sentence Splitting
The model was trained on sentence-level input โ each training example is a single sentence, split and tokenized using spaCy. For best results, split your input into sentences before inference. Passing unsplit speaker turns (multiple sentences as one input) can cause entities to be missed.
import re
import spacy
nlp = spacy.blank("en") # use "de" for German, "xx" for multilingual
nlp.add_pipe("sentencizer")
def split_dialogue(text):
"""
Split raw dialogue text into a list of sentences (each a list of tokens).
Expects lines like: 'SPEAKER_00: Hello, my name is Peter.'
"""
sentences = []
for line in text.strip().splitlines():
m = re.match(r"^(SPEAKER_\d+)\s*:\s*(.*)", line.strip())
if m:
line = m.group(2)
if not line:
continue
doc = nlp(line)
for sent in doc.sents:
tokens = [tok.text for tok in sent if not tok.is_space]
if tokens:
sentences.append(tokens)
return sentences
# Example
raw = """SPEAKER_00: Hello, my name is Peter.
SPEAKER_01: Hello, my name is Peter as well. Okay, and where do you come from? I come from Chicago."""
dialogue = split_dialogue(raw)
# [['Hello', ',', 'my', 'name', 'is', 'Peter', '.'],
# ['Hello', ',', 'my', 'name', 'is', 'Peter', 'as', 'well', '.'],
# ['Okay', ',', 'and', 'where', 'do', 'you', 'come', 'from', '?'],
# ['I', 'come', 'from', 'Chicago', '.']]
Inference
def predict_sentences(sentences, model, tokenizer, id2label, device="cpu"):
"""
sentences: list of sentences, each a list of word tokens
Returns: list of label lists, one per sentence
"""
all_labels = []
for tokens in sentences:
enc = tokenizer(tokens, is_split_into_words=True,
return_tensors="pt", truncation=True, max_length=512).to(device)
word_ids = enc.word_ids(batch_index=0)
with torch.no_grad():
outputs = model(**enc)
emissions = outputs["logits"]
mask = enc["attention_mask"].bool()
preds = model.decode(emissions, mask)[0]
word_labels = ["O"] * len(tokens)
seen = set()
for idx, wid in enumerate(word_ids):
if wid is None or wid in seen:
continue
seen.add(wid)
word_labels[wid] = id2label[preds[idx]]
all_labels.append(word_labels)
return all_labels
# Example: dialogue from above
results = predict_sentences(dialogue, model, tokenizer, id2label)
for sent_tokens, sent_labels in zip(dialogue, results):
for token, label in zip(sent_tokens, sent_labels):
if label != "O":
print(f"{token:20s} -> {label}")
Single-sentence inference
If you only have isolated sentences, you can pass them directly:
tokens = ["My", "name", "is", "John", "Smith", "and", "I", "live", "in", "Berlin", "."]
enc = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=512)
word_ids = enc.word_ids(batch_index=0)
with torch.no_grad():
outputs = model(**enc)
emissions = outputs["logits"]
mask = enc["attention_mask"].bool()
preds = model.decode(emissions, mask)[0]
seen = set()
for idx, wid in enumerate(word_ids):
if wid is None or wid in seen:
continue
seen.add(wid)
label = id2label[preds[idx]]
if label != "O":
print(f"{tokens[wid]:20s} -> {label}")
Training Data
The model was trained on synthetic multilingual dialogue data covering various domains (medical anamnesis, customer support, police reports, therapy sessions, etc.). The data was generated and annotated as part of a thesis project on multilingual PII de-identification.
Limitations
- Trained on synthetic dialogue data; performance on real-world data may vary
- Optimized for dialogue/conversational text; may underperform on formal documents
- Arabic and Hindi show lower performance compared to European languages
- Requires
pytorch-crfpackage for inference
Citation
If you use this model, please cite:
@mastersthesis{erman2026multilingual,
title={Multilingual De-Identification of Dialogue Data using Transformer-based NER},
author={Erman, Derya},
year={2026}
}
Model tree for deryaerman/mmbert_multilingual_pii_ner
Base model
jhu-clsp/mmBERT-base