metadata
			license: mit
base_model: distilbert-base-uncased
tags:
  - token-classification
  - pii
  - privacy
  - personal-information
  - bert
  - distilbert
language:
  - en
pipeline_tag: token-classification
library_name: transformers
datasets:
  - ai4privacy/pii-masking-200k
metrics:
  - f1
  - precision
  - recall
widget:
  - text: Hi, my name is John Smith and my email is [email protected]
    example_title: Example with PII
BERT PII Detection Model
Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.
Model Details
- Base Model: distilbert-base-uncased
- Task: Token Classification (Named Entity Recognition)
- Languages: English
- License: MIT
- Fine-tuned on: AI4Privacy PII-42k dataset
Supported PII Entity Types
This model can detect 56 different types of PII entities including:
Personal Information:
- FIRSTNAME, LASTNAME, MIDDLENAME
- EMAIL, PHONENUMBER, USERNAME
- DATE, TIME, DOB, AGE
Address Information:
- STREET, CITY, STATE, COUNTY
- ZIPCODE, BUILDINGNUMBER
- SECONDARYADDRESS
Financial Information:
- CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
- ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
- AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL
Identification:
- SSN, PIN, PASSWORD
- IP, IPV4, IPV6, MAC
- ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS
Professional Information:
- JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load model and tokenizer
model_name = "SoelMgd/bert-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create NER pipeline
ner_pipeline = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)
# Example usage
text = "Hi, my name is John Smith and my email is [email protected]"
entities = ner_pipeline(text)
print(entities)
Training Data
- Dataset: AI4Privacy PII-200k
- Size: ~209k examples
- Languages: English, French, German, Italian (this model: English only)
- Entity Types: 56 different PII categories
Performance
The model achieves high performance on PII detection tasks with good precision and recall across different entity types.
Intended Use
This model is designed for:
- PII detection and masking in text
- Privacy compliance applications
- Data anonymization pipelines
- Content moderation systems
Limitations
- Trained primarily on English text
- May not generalize to domain-specific jargon
- Performance may vary on very short or very long texts
- Should be validated on your specific use case