Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +138 -0
config.json +141 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,138 @@

+---
+license: mit
+base_model: distilbert-base-uncased
+tags:
+- token-classification
+- pii
+- privacy
+- personal-information
+- bert
+- distilbert
+language:
+- en
+pipeline_tag: token-classification
+library_name: transformers
+datasets:
+- ai4privacy/pii-masking-200k
+metrics:
+- f1
+- precision
+- recall
+widget:
+- text: "Hi, my name is John Smith and my email is [email protected]"
+  example_title: "Example with PII"
+---
+# BERT PII Detection Model
+Fine-tuned DistilBERT model for Personal Identifiable Information (PII) detection and classification.
+## Model Details
+- **Base Model**: `distilbert-base-uncased`
+- **Task**: Token Classification (Named Entity Recognition)
+- **Languages**: English
+- **License**: MIT
+- **Fine-tuned on**: AI4Privacy PII-200k dataset
+## Supported PII Entity Types
+This model can detect 56 different types of PII entities including:
+**Personal Information:**
+- FIRSTNAME, LASTNAME, MIDDLENAME
+- EMAIL, PHONENUMBER, USERNAME
+- DATE, TIME, DOB, AGE
+**Address Information:**
+- STREET, CITY, STATE, COUNTY
+- ZIPCODE, BUILDINGNUMBER
+- SECONDARYADDRESS
+**Financial Information:**
+- CREDITCARDNUMBER, CREDITCARDISSUER, CREDITCARDCVV
+- ACCOUNTNAME, ACCOUNTNUMBER, IBAN, BIC
+- AMOUNT, CURRENCY, CURRENCYCODE, CURRENCYSYMBOL
+**Identification:**
+- SSN, PIN, PASSWORD
+- IP, IPV4, IPV6, MAC
+- ETHEREUMADDRESS, BITCOINADDRESS, LITECOINADDRESS
+**Professional Information:**
+- JOBTITLE, JOBTYPE, JOBAREA, COMPANYNAME
+**And many more...**
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+from transformers import pipeline
+# Load model and tokenizer
+model_name = "SoelMgd/bert-pii-detection"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Create NER pipeline
+ner_pipeline = pipeline(
+    "ner",
+    model=model,
+    tokenizer=tokenizer,
+    aggregation_strategy="simple"
+)
+# Example usage
+text = "Hi, my name is John Smith and my email is [email protected]"
+entities = ner_pipeline(text)
+print(entities)
+```
+## Training Data
+- **Dataset**: AI4Privacy PII-200k
+- **Size**: ~209k examples
+- **Languages**: English, French, German, Italian (this model: English only)
+- **Entity Types**: 56 different PII categories
+## Performance
+The model achieves high performance on PII detection tasks with good precision and recall across different entity types.
+## Intended Use
+This model is designed for:
+- PII detection and masking in text
+- Privacy compliance applications
+- Data anonymization pipelines
+- Content moderation systems
+## Limitations
+- Trained primarily on English text
+- May not generalize to domain-specific jargon
+- Performance may vary on very short or very long texts
+- Should be validated on your specific use case
+## Ethical Considerations
+This model is intended to help protect privacy by identifying PII. Users should:
+- Test thoroughly on their specific data
+- Implement appropriate safeguards
+- Consider the legal requirements in their jurisdiction
+- Be aware that no automated system is 100% accurate
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{bert-pii-detection,
+  title={BERT PII Detection Model},
+  author={SoelMgd},
+  year={2025},
+  publisher={Hugging Face},
+  url={https://huggingface.co/SoelMgd/bert-pii-detection}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,141 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForTokenClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "ACCOUNTNAME",
+    "1": "ACCOUNTNUMBER",
+    "2": "AGE",
+    "3": "AMOUNT",
+    "4": "BIC",
+    "5": "BITCOINADDRESS",
+    "6": "BUILDINGNUMBER",
+    "7": "CITY",
+    "8": "COMPANYNAME",
+    "9": "COUNTY",
+    "10": "CREDITCARDCVV",
+    "11": "CREDITCARDISSUER",
+    "12": "CREDITCARDNUMBER",
+    "13": "CURRENCY",
+    "14": "CURRENCYCODE",
+    "15": "CURRENCYNAME",
+    "16": "CURRENCYSYMBOL",
+    "17": "DATE",
+    "18": "DOB",
+    "19": "EMAIL",
+    "20": "ETHEREUMADDRESS",
+    "21": "EYECOLOR",
+    "22": "FIRSTNAME",
+    "23": "GENDER",
+    "24": "HEIGHT",
+    "25": "IBAN",
+    "26": "IP",
+    "27": "IPV4",
+    "28": "IPV6",
+    "29": "JOBAREA",
+    "30": "JOBTITLE",
+    "31": "JOBTYPE",
+    "32": "LASTNAME",
+    "33": "LITECOINADDRESS",
+    "34": "MAC",
+    "35": "MASKEDNUMBER",
+    "36": "MIDDLENAME",
+    "37": "NEARBYGPSCOORDINATE",
+    "38": "O",
+    "39": "ORDINALDIRECTION",
+    "40": "PASSWORD",
+    "41": "PHONEIMEI",
+    "42": "PHONENUMBER",
+    "43": "PIN",
+    "44": "PREFIX",
+    "45": "SECONDARYADDRESS",
+    "46": "SEX",
+    "47": "SSN",
+    "48": "STATE",
+    "49": "STREET",
+    "50": "TIME",
+    "51": "URL",
+    "52": "USERAGENT",
+    "53": "USERNAME",
+    "54": "VEHICLEVIN",
+    "55": "VEHICLEVRM",
+    "56": "ZIPCODE"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "ACCOUNTNAME": 0,
+    "ACCOUNTNUMBER": 1,
+    "AGE": 2,
+    "AMOUNT": 3,
+    "BIC": 4,
+    "BITCOINADDRESS": 5,
+    "BUILDINGNUMBER": 6,
+    "CITY": 7,
+    "COMPANYNAME": 8,
+    "COUNTY": 9,
+    "CREDITCARDCVV": 10,
+    "CREDITCARDISSUER": 11,
+    "CREDITCARDNUMBER": 12,
+    "CURRENCY": 13,
+    "CURRENCYCODE": 14,
+    "CURRENCYNAME": 15,
+    "CURRENCYSYMBOL": 16,
+    "DATE": 17,
+    "DOB": 18,
+    "EMAIL": 19,
+    "ETHEREUMADDRESS": 20,
+    "EYECOLOR": 21,
+    "FIRSTNAME": 22,
+    "GENDER": 23,
+    "HEIGHT": 24,
+    "IBAN": 25,
+    "IP": 26,
+    "IPV4": 27,
+    "IPV6": 28,
+    "JOBAREA": 29,
+    "JOBTITLE": 30,
+    "JOBTYPE": 31,
+    "LASTNAME": 32,
+    "LITECOINADDRESS": 33,
+    "MAC": 34,
+    "MASKEDNUMBER": 35,
+    "MIDDLENAME": 36,
+    "NEARBYGPSCOORDINATE": 37,
+    "O": 38,
+    "ORDINALDIRECTION": 39,
+    "PASSWORD": 40,
+    "PHONEIMEI": 41,
+    "PHONENUMBER": 42,
+    "PIN": 43,
+    "PREFIX": 44,
+    "SECONDARYADDRESS": 45,
+    "SEX": 46,
+    "SSN": 47,
+    "STATE": 48,
+    "STREET": 49,
+    "TIME": 50,
+    "URL": 51,
+    "USERAGENT": 52,
+    "USERNAME": 53,
+    "VEHICLEVIN": 54,
+    "VEHICLEVRM": 55,
+    "ZIPCODE": 56
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.52.4",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:49023596d0378adbb3ea9e52bf4085c7d0c3dcc594397ab72b65594d8086cbd1
+size 265639204

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c7d1ec4cf7dcd7328946a00f36d8589b1df61b1b2703a5bcff0f4dec4158c02c
+size 5240

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff