---
language: en
tags:
  - pii-detection
  - ner
  - luganda
  - differential-privacy
license: apache-2.0
datasets:
  - Conrad747/lg-ner
---

# dp_pii_luganda_ner_model

## Model Description
This is a fine-tuned token classification model based on `Conrad747/luganda-ner-v6` for detecting Personally Identifiable Information (PII) such as names, emails, phone numbers, and dates of birth. The model was trained with differential privacy (noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4) to ensure strong privacy guarantees, making it suitable for sensitive data applications.

## Intended Uses
- **Primary Use Case**: Identifying PII in text data, particularly for Luganda and English texts.
- **Supported Entities**: NAME, EMAIL, PHONE, DOB (adjust based on dataset labels).
- **Applications**: Data anonymization, compliance with privacy regulations (e.g., GDPR), secure text processing.

## How to Use
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import json

# Load model and tokenizer
model_name = "e4gl33y3/dp_pii_luganda_ner_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Define classify_pii function
def classify_pii(text, model, tokenizer, device="cuda" if torch.cuda.is_available() else "cpu", max_length=128):
    model.to(device)
    model.eval()
    inputs = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt"
    ).to(device)
    
    # Use model's id2label for accurate label mapping
    label_map = model.config.id2label
    
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)[0].cpu().numpy()
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    word_ids = inputs.word_ids()
    previous_word_idx = None
    pii_entities = []
    current_entity = {"type": None, "value": [], "start": None}
    
    for idx, (token, pred, word_idx) in enumerate(zip(tokens, predictions, word_ids)):
        label = label_map.get(pred, "O")
        if word_idx is None or token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue
        if label.startswith("B-"):
            if current_entity["type"] is not None:
                pii_entities.append({
                    "type": current_entity["type"],
                    "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
                    "start": current_entity["start"]
                })
            current_entity = {"type": label[2:], "value": [token], "start": idx}
        elif label.startswith("I-") and current_entity["type"] == label[2:] and word_idx == previous_word_idx:
            current_entity["value"].append(token)
        else:
            if current_entity["type"] is not None:
                pii_entities.append({
                    "type": current_entity["type"],
                    "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
                    "start": current_entity["start"]
                })
            current_entity = {"type": None, "value": [], "start": None}
        previous_word_idx = word_idx
    
    if current_entity["type"] is not None:
        pii_entities.append({
            "type": current_entity["type"],
            "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
            "start": current_entity["start"]
        })
    
    return {"text": text, "entities": pii_entities}

# Example usage
text = "Ssemakula yategese ekivvulu okutalaaga ebitundu omuli Buddu ne Bulemeezi."
result = classify_pii(text, model, tokenizer)
print(json.dumps(result, indent=2))
```

## Training Details
- **Dataset**: Trained on `Conrad747/lg-ner` dataset.
- **Privacy**: Differential privacy applied with noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4.
- **Optimizer**: AdamW with learning rate 5e-5.
- **Epochs**: 5
- **Batch Size**: 8 (with BatchMemoryManager for memory efficiency).

## Evaluation
- **Precision**: 0.9445
- **Recall**: 0.9438
- **F1 Score**: 0.9436

## Limitations
- Optimized for Luganda and English PII detection; performance may vary for other languages.
- Differential privacy may introduce noise, potentially affecting accuracy for rare entities.
- Label mapping must match dataset labels for accurate inference.

## Contact
For issues or contributions, please visit the repository on Hugging Face or contact e4gl33y3.