bert-pii-detection / README.md
SoelMgd's picture
Update README.md
6a385a5 verified
|
raw
history blame
2.76 kB
metadata
license: mit
base_model: distilbert-base-uncased
tags:
  - token-classification
  - pii
  - privacy
  - personal-information
  - bert
  - distilbert
language:
  - en
pipeline_tag: token-classification
library_name: transformers
datasets:
  - ai4privacy/pii-masking-200k
metrics:
  - f1
  - precision
  - recall
widget:
  - text: Hi, my name is John Smith and my email is [email protected]
    example_title: Example with PII

BERT PII Detection Model

Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.

Model Details

  • Base Model: distilbert-base-uncased
  • Task: Token Classification (Named Entity Recognition)
  • Languages: English
  • License: MIT
  • Fine-tuned on: AI4Privacy PII-42k dataset

Supported PII Entity Types

This model can detect 56 different types of PII entities including:

Personal Information:

  • FIRSTNAME, LASTNAME, MIDDLENAME
  • EMAIL, PHONENUMBER, USERNAME
  • DATE, TIME, DOB, AGE

Address Information:

  • STREET, CITY, STATE, COUNTY
  • ZIPCODE, BUILDINGNUMBER
  • SECONDARYADDRESS

Financial Information:

  • CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
  • ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
  • AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL

Identification:

  • SSN, PIN, PASSWORD
  • IP, IPV4, IPV6, MAC
  • ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS

Professional Information:

  • JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load model and tokenizer
model_name = "SoelMgd/bert-pii-detection"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create NER pipeline
ner_pipeline = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Example usage
text = "Hi, my name is John Smith and my email is [email protected]"
entities = ner_pipeline(text)
print(entities)

Training Data

  • Dataset: AI4Privacy PII-200k
  • Size: ~209k examples
  • Languages: English, French, German, Italian (this model: English only)
  • Entity Types: 56 different PII categories

Performance

The model achieves high performance on PII detection tasks with good precision and recall across different entity types.

Intended Use

This model is designed for:

  • PII detection and masking in text
  • Privacy compliance applications
  • Data anonymization pipelines
  • Content moderation systems

Limitations

  • Trained primarily on English text
  • May not generalize to domain-specific jargon
  • Performance may vary on very short or very long texts
  • Should be validated on your specific use case