---
library_name: transformers
tags:
- pii
- ner
- asr
- redaction
- privacy
- lexguard
- xlm-roberta
- multilingual
- huggingface
- token-classification
license: apache-2.0
language:
- en
- hi
metrics:
- seqeval
base_model:
- FacebookAI/xlm-roberta-base
---

[![Hugging Face](https://img.shields.io/badge/HuggingFace-ZentryPII-yellow)](https://huggingface.co/sanskxr02/zentrypii-278m)

# ZentryPII-278M – A LexGuard Model

ZentryPII-278M is a multilingual token classification model fine-tuned to identify and redact personally identifiable information (PII) — such as names, locations, and time expressions — from noisy, ASR-style transcripts in English and Hindi. Built on top of XLM-RoBERTa-base, it is designed to serve as the redaction engine for LexGuard’s privacy-preserving speech-to-text workflows.


## Model Details

- **Model Name:** ZentryPII-278M  
- **Architecture:** XLM-RoBERTa-base  
- **Parameters:** ~278M  
- **Task:** Token Classification (NER-style)  
- **Labels:** B-NAME, B-LOC, B-TIME, O  
- **Languages:** English, Hindi  
- **Training Dataset:** Synthetic ASR-style BIO-labeled dataset (~1,000 samples)  
- **Fine-tuning Epochs:** 5  
- **Framework:** Hugging Face Transformers  
- **Developer:** LexGuard  


### Model Description

ZentryPII-278M is a multilingual token classification model developed by LexGuard to detect and redact personally identifiable information (PII) from noisy automatic speech recognition (ASR) outputs. It is fine-tuned on synthetic ASR-style transcripts that include disfluencies, Hindi-English code-switching, and real-world conversational patterns.

- **Developed by:** [LexGuard]
- **Funded by [optional]:** LexGuard
- **Shared by [optional]:** Sanskar Pandey
- **Model type:** Token Classification (NER)
- **Language(s) (NLP):** English, Hindi
- **License:** Apache 2.0


## Uses

### Direct Use

ZentryPII-278M is intended for direct use in redacting PII from ASR transcripts across multilingual, informal, or code-switched contexts. Users can apply it to:
- Transcribed audio from customer support calls
- Patient interviews and medical notes
- Legal and financial voice dictations
- Internal company meetings

It can be used via Hugging Face pipelines or as part of a preprocessing module in privacy-sensitive workflows.

### Out-of-Scope Use

This model is not intended for:
- Document-level NER on structured or formal text (e.g. PDFs, contracts)
- Coreference resolution or full conversational entity linking
- Real-time inference in low-resource, on-device settings without optimization
- Use in adversarial or surveillance applications that violate user privacy

---

## Bias, Risks, and Limitations

- **Cultural and spelling bias:** Model was trained on synthetic English-Hindi examples and may underperform on other dialects or spelling variations.
- **Disfluency confusion:** In very noisy ASR outputs, model may struggle to distinguish PII from filler or background phrases.
- **False positives/negatives:** Names that are also common nouns (e.g. "Rose", "Paris") may be missed or over-flagged.
- **No anonymization guarantees:** While helpful, the model does not provide cryptographic or legal guarantees for PII anonymization.

Always verify redacted output before deployment in sensitive or regulated environments.


### Recommendations

Users should be aware that ZentryPII-278M is optimized for ASR-style conversational input in English and Hindi. It should not be relied on as a sole mechanism for PII redaction in legally regulated environments. We recommend:

- Reviewing model output manually in high-risk domains such as healthcare or law
- Avoiding use in languages or dialects beyond those it was trained on
- Augmenting the model with rule-based fallback mechanisms for edge cases
- Retraining or fine-tuning on domain-specific data when applying to new use cases

---

# ZentryPII-278M

## How to Get Started with the Model

Use the code snippet below to run the model using 🤗 Transformers:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("sanskxr02/zentrypii-278m")
tokenizer = AutoTokenizer.from_pretrained("sanskxr02/zentrypii-278m")

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

output = ner("i met rohit near connaught place at three thirty")
for ent in output:
    print(f"{ent['word']} → {ent['entity_group']}")