mmBERT Multilingual PII NER

A fine-tuned jhu-clsp/mmBERT-base model with a CRF layer for Personally Identifiable Information (PII) detection in multilingual dialogues across 11 languages.

Model Description

This model performs token-level Named Entity Recognition (NER) to identify and classify PII entities in dialogue text. It was trained on synthetic multilingual de-identification of conversational data.

Architecture: mmBERT-base (ModernBERT) + CRF head
Training: Fine-tuned on all 11 languages jointly (multilingual training)
Loss: Cross-Entropy
Hyperparameters: lr=2e-05, batch_size=32, max_length=512, dropout=0.1, epochs=10
Decoding: Viterbi decoding via CRF layer

Supported Languages

Code	Language
AR	Arabic
DE	German
EN	English
FI	Finnish
FR	French
HI	Hindi
IT	Italian
PL	Polish
PT	Portuguese
SP	Spanish
TR	Turkish

Entity Types

The model recognizes 19 PII entity types using BIO tagging:

Entity	Description
`PERSON`	Person names
`PERSON_EMAIL`	Email addresses
`PERSON_SOCIAL_RELATION`	Social relations (e.g., "my wife")
`ORG`	Organizations
`LOC_CITY`	Cities
`LOC_COUNTRY`	Countries
`LOC_STREET`	Street names
`LOC_ZIP`	ZIP/postal codes
`LOC_HOUSENUMBER`	House numbers
`LOC_OTHER`	Other locations
`DATETIME`	Dates and times
`DATETIME_AGE`	Ages
`CODE`	ID numbers, reference codes
`CODE_PHONE`	Phone numbers
`CODE_URL`	URLs
`PROFESSION`	Professions
`PRODUCT`	Product names
`QUANTITY`	Quantities
`MISC`	Miscellaneous PII

Performance

Evaluated on held-out test sets per language (type-aware micro scores):

Language	Lenient F1	Lenient F2	Exact F1	Exact F2
AR	80.76	76.66	76.99	73.08
DE	91.66	90.71	90.54	89.60
EN	93.68	92.70	91.66	90.70
FI	87.70	86.77	85.65	84.73
FR	87.26	85.89	83.68	82.36
HI	84.94	82.91	81.26	79.31
IT	90.03	88.19	87.14	85.35
PL	89.33	89.45	86.17	86.29
PT	90.30	89.15	88.81	87.68
SP	91.39	90.76	89.62	89.00
TR	85.53	84.72	82.06	81.27
AVG	88.42	87.08	85.78	84.49

Usage

This model uses a custom CRF architecture and cannot be loaded directly with AutoModelForTokenClassification. You need to use the custom ModernBertCRF class.

Setup

import torch
import json
from transformers import AutoModel, AutoTokenizer
from torchcrf import CRF
import torch.nn as nn

class ModernBertCRF(nn.Module):
    def __init__(self, base_model_name, num_labels, id2label, label2id):
        super().__init__()
        self.num_labels = num_labels
        self.id2label = id2label
        self.label2id = label2id
        self.transformer = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.transformer.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_labels)
        self.dropout = nn.Dropout(0.1)
        self.crf = CRF(num_labels, batch_first=True)

    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        kwargs.pop("token_type_ids", None)
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs.last_hidden_state)
        emissions = self.classifier(sequence_output)
        if labels is not None:
            mask = attention_mask.bool()
            labels_for_crf = labels.clone()
            labels_for_crf[labels_for_crf == -100] = 0
            loss = -self.crf(emissions, labels_for_crf, mask=mask, reduction='mean')
            return {"loss": loss, "logits": emissions}
        else:
            return {"logits": emissions}

    def decode(self, emissions, mask):
        return self.crf.decode(emissions, mask=mask)

# Load model
model_dir = "deryaerman/mmbert_multilingual_pii_ner"

with open(f"{model_dir}/crf_config.json") as f:
    config = json.load(f)

model = ModernBertCRF(
    base_model_name=config["base_model_name"],
    num_labels=config["num_labels"],
    id2label=config["id2label"],
    label2id=config["label2id"],
)
model.load_state_dict(torch.load(f"{model_dir}/pytorch_model.bin", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_dir)
id2label = {int(k): v for k, v in config["id2label"].items()}

Preprocessing: Sentence Splitting

The model was trained on sentence-level input — each training example is a single sentence, split and tokenized using spaCy. For best results, split your input into sentences before inference. Passing unsplit speaker turns (multiple sentences as one input) can cause entities to be missed.

import re
import spacy

nlp = spacy.blank("en")          # use "de" for German, "xx" for multilingual
nlp.add_pipe("sentencizer")

def split_dialogue(text):
    """
    Split raw dialogue text into a list of sentences (each a list of tokens).
    Expects lines like: 'SPEAKER_00: Hello, my name is Peter.'
    """
    sentences = []
    for line in text.strip().splitlines():
        m = re.match(r"^(SPEAKER_\d+)\s*:\s*(.*)", line.strip())
        if m:
            line = m.group(2)
        if not line:
            continue
        doc = nlp(line)
        for sent in doc.sents:
            tokens = [tok.text for tok in sent if not tok.is_space]
            if tokens:
                sentences.append(tokens)
    return sentences

# Example
raw = """SPEAKER_00: Hello, my name is Peter.
SPEAKER_01: Hello, my name is Peter as well. Okay, and where do you come from? I come from Chicago."""

dialogue = split_dialogue(raw)
# [['Hello', ',', 'my', 'name', 'is', 'Peter', '.'],
#  ['Hello', ',', 'my', 'name', 'is', 'Peter', 'as', 'well', '.'],
#  ['Okay', ',', 'and', 'where', 'do', 'you', 'come', 'from', '?'],
#  ['I', 'come', 'from', 'Chicago', '.']]

Inference

def predict_sentences(sentences, model, tokenizer, id2label, device="cpu"):
    """
    sentences: list of sentences, each a list of word tokens
    Returns:   list of label lists, one per sentence
    """
    all_labels = []
    for tokens in sentences:
        enc = tokenizer(tokens, is_split_into_words=True,
                        return_tensors="pt", truncation=True, max_length=512).to(device)
        word_ids = enc.word_ids(batch_index=0)

        with torch.no_grad():
            outputs = model(**enc)
            emissions = outputs["logits"]
            mask = enc["attention_mask"].bool()
            preds = model.decode(emissions, mask)[0]

        word_labels = ["O"] * len(tokens)
        seen = set()
        for idx, wid in enumerate(word_ids):
            if wid is None or wid in seen:
                continue
            seen.add(wid)
            word_labels[wid] = id2label[preds[idx]]

        all_labels.append(word_labels)

    return all_labels


# Example: dialogue from above
results = predict_sentences(dialogue, model, tokenizer, id2label)

for sent_tokens, sent_labels in zip(dialogue, results):
    for token, label in zip(sent_tokens, sent_labels):
        if label != "O":
            print(f"{token:20s} -> {label}")

Single-sentence inference

If you only have isolated sentences, you can pass them directly:

tokens = ["My", "name", "is", "John", "Smith", "and", "I", "live", "in", "Berlin", "."]

enc = tokenizer(tokens, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=512)
word_ids = enc.word_ids(batch_index=0)

with torch.no_grad():
    outputs = model(**enc)
    emissions = outputs["logits"]
    mask = enc["attention_mask"].bool()
    preds = model.decode(emissions, mask)[0]

seen = set()
for idx, wid in enumerate(word_ids):
    if wid is None or wid in seen:
        continue
    seen.add(wid)
    label = id2label[preds[idx]]
    if label != "O":
        print(f"{tokens[wid]:20s} -> {label}")

Training Data

The model was trained on synthetic multilingual dialogue data covering various domains (medical anamnesis, customer support, police reports, therapy sessions, etc.). The data was generated and annotated as part of a thesis project on multilingual PII de-identification.

Limitations

Trained on synthetic dialogue data; performance on real-world data may vary
Optimized for dialogue/conversational text; may underperform on formal documents
Arabic and Hindi show lower performance compared to European languages
Requires pytorch-crf package for inference

Citation

If you use this model, please cite:

@mastersthesis{erman2026multilingual,
  title={Multilingual De-Identification of Dialogue Data using Transformer-based NER},
  author={Erman, Derya},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for deryaerman/mmbert_multilingual_pii_ner

Base model

jhu-clsp/mmBERT-base

Finetuned

(93)

this model