dp_pii_luganda_ner_model

Model Description

This is a fine-tuned token classification model based on Conrad747/luganda-ner-v6 for detecting Personally Identifiable Information (PII) such as names, emails, phone numbers, and dates of birth. The model was trained with differential privacy (noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4) to ensure strong privacy guarantees, making it suitable for sensitive data applications.

Intended Uses

  • Primary Use Case: Identifying PII in text data, particularly for Luganda and English texts.
  • Supported Entities: NAME, EMAIL, PHONE, DOB (adjust based on dataset labels).
  • Applications: Data anonymization, compliance with privacy regulations (e.g., GDPR), secure text processing.

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import json

# Load model and tokenizer
model_name = "e4gl33y3/dp_pii_luganda_ner_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Define classify_pii function
def classify_pii(text, model, tokenizer, device="cuda" if torch.cuda.is_available() else "cpu", max_length=128):
    model.to(device)
    model.eval()
    inputs = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt"
    ).to(device)
    
    # Use model's id2label for accurate label mapping
    label_map = model.config.id2label
    
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)[0].cpu().numpy()
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    word_ids = inputs.word_ids()
    previous_word_idx = None
    pii_entities = []
    current_entity = {"type": None, "value": [], "start": None}
    
    for idx, (token, pred, word_idx) in enumerate(zip(tokens, predictions, word_ids)):
        label = label_map.get(pred, "O")
        if word_idx is None or token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue
        if label.startswith("B-"):
            if current_entity["type"] is not None:
                pii_entities.append({
                    "type": current_entity["type"],
                    "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
                    "start": current_entity["start"]
                })
            current_entity = {"type": label[2:], "value": [token], "start": idx}
        elif label.startswith("I-") and current_entity["type"] == label[2:] and word_idx == previous_word_idx:
            current_entity["value"].append(token)
        else:
            if current_entity["type"] is not None:
                pii_entities.append({
                    "type": current_entity["type"],
                    "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
                    "start": current_entity["start"]
                })
            current_entity = {"type": None, "value": [], "start": None}
        previous_word_idx = word_idx
    
    if current_entity["type"] is not None:
        pii_entities.append({
            "type": current_entity["type"],
            "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
            "start": current_entity["start"]
        })
    
    return {"text": text, "entities": pii_entities}

# Example usage
text = "Ssemakula yategese ekivvulu okutalaaga ebitundu omuli Buddu ne Bulemeezi."
result = classify_pii(text, model, tokenizer)
print(json.dumps(result, indent=2))

Training Details

  • Dataset: Trained on Conrad747/lg-ner dataset.
  • Privacy: Differential privacy applied with noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4.
  • Optimizer: AdamW with learning rate 5e-5.
  • Epochs: 5
  • Batch Size: 8 (with BatchMemoryManager for memory efficiency).

Evaluation

  • Precision: 0.9445
  • Recall: 0.9438
  • F1 Score: 0.9436

Limitations

  • Optimized for Luganda and English PII detection; performance may vary for other languages.
  • Differential privacy may introduce noise, potentially affecting accuracy for rare entities.
  • Label mapping must match dataset labels for accurate inference.

Contact

For issues or contributions, please visit the repository on Hugging Face or contact e4gl33y3.

Downloads last month
4
Safetensors
Model size
277M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train e4gl33y3/dp_pii_luganda_ner_model