|
|
|
--- |
|
language: en |
|
tags: |
|
- pii-detection |
|
- ner |
|
- luganda |
|
- differential-privacy |
|
license: apache-2.0 |
|
datasets: |
|
- Conrad747/lg-ner |
|
--- |
|
|
|
# dp_pii_luganda_ner_model |
|
|
|
## Model Description |
|
This is a fine-tuned token classification model based on `Conrad747/luganda-ner-v6` for detecting Personally Identifiable Information (PII) such as names, emails, phone numbers, and dates of birth. The model was trained with differential privacy (noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4) to ensure strong privacy guarantees, making it suitable for sensitive data applications. |
|
|
|
## Intended Uses |
|
- **Primary Use Case**: Identifying PII in text data, particularly for Luganda and English texts. |
|
- **Supported Entities**: NAME, EMAIL, PHONE, DOB (adjust based on dataset labels). |
|
- **Applications**: Data anonymization, compliance with privacy regulations (e.g., GDPR), secure text processing. |
|
|
|
## How to Use |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
import torch |
|
import json |
|
|
|
# Load model and tokenizer |
|
model_name = "e4gl33y3/dp_pii_luganda_ner_model" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
# Define classify_pii function |
|
def classify_pii(text, model, tokenizer, device="cuda" if torch.cuda.is_available() else "cpu", max_length=128): |
|
model.to(device) |
|
model.eval() |
|
inputs = tokenizer( |
|
text, |
|
truncation=True, |
|
padding="max_length", |
|
max_length=max_length, |
|
return_tensors="pt" |
|
).to(device) |
|
|
|
# Use model's id2label for accurate label mapping |
|
label_map = model.config.id2label |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
predictions = torch.argmax(logits, dim=2)[0].cpu().numpy() |
|
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
word_ids = inputs.word_ids() |
|
previous_word_idx = None |
|
pii_entities = [] |
|
current_entity = {"type": None, "value": [], "start": None} |
|
|
|
for idx, (token, pred, word_idx) in enumerate(zip(tokens, predictions, word_ids)): |
|
label = label_map.get(pred, "O") |
|
if word_idx is None or token in ["[CLS]", "[SEP]", "[PAD]"]: |
|
continue |
|
if label.startswith("B-"): |
|
if current_entity["type"] is not None: |
|
pii_entities.append({ |
|
"type": current_entity["type"], |
|
"value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(), |
|
"start": current_entity["start"] |
|
}) |
|
current_entity = {"type": label[2:], "value": [token], "start": idx} |
|
elif label.startswith("I-") and current_entity["type"] == label[2:] and word_idx == previous_word_idx: |
|
current_entity["value"].append(token) |
|
else: |
|
if current_entity["type"] is not None: |
|
pii_entities.append({ |
|
"type": current_entity["type"], |
|
"value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(), |
|
"start": current_entity["start"] |
|
}) |
|
current_entity = {"type": None, "value": [], "start": None} |
|
previous_word_idx = word_idx |
|
|
|
if current_entity["type"] is not None: |
|
pii_entities.append({ |
|
"type": current_entity["type"], |
|
"value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(), |
|
"start": current_entity["start"] |
|
}) |
|
|
|
return {"text": text, "entities": pii_entities} |
|
|
|
# Example usage |
|
text = "Ssemakula yategese ekivvulu okutalaaga ebitundu omuli Buddu ne Bulemeezi." |
|
result = classify_pii(text, model, tokenizer) |
|
print(json.dumps(result, indent=2)) |
|
``` |
|
|
|
## Training Details |
|
- **Dataset**: Trained on `Conrad747/lg-ner` dataset. |
|
- **Privacy**: Differential privacy applied with noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4. |
|
- **Optimizer**: AdamW with learning rate 5e-5. |
|
- **Epochs**: 5 |
|
- **Batch Size**: 8 (with BatchMemoryManager for memory efficiency). |
|
|
|
## Evaluation |
|
- **Precision**: 0.9445 |
|
- **Recall**: 0.9438 |
|
- **F1 Score**: 0.9436 |
|
|
|
## Limitations |
|
- Optimized for Luganda and English PII detection; performance may vary for other languages. |
|
- Differential privacy may introduce noise, potentially affecting accuracy for rare entities. |
|
- Label mapping must match dataset labels for accurate inference. |
|
|
|
## Contact |
|
For issues or contributions, please visit the repository on Hugging Face or contact e4gl33y3. |
|
|