ZentryPII-278M โ A LexGuard Model
ZentryPII-278M is a multilingual token classification model fine-tuned to identify and redact personally identifiable information (PII) โ such as names, locations, and time expressions โ from noisy, ASR-style transcripts in English and Hindi. Built on top of XLM-RoBERTa-base, it is designed to serve as the redaction engine for LexGuardโs privacy-preserving speech-to-text workflows.
Model Details
- Model Name: ZentryPII-278M
- Architecture: XLM-RoBERTa-base
- Parameters: ~278M
- Task: Token Classification (NER-style)
- Labels: B-NAME, B-LOC, B-TIME, O
- Languages: English, Hindi
- Training Dataset: Synthetic ASR-style BIO-labeled dataset (~1,000 samples)
- Fine-tuning Epochs: 5
- Framework: Hugging Face Transformers
- Developer: LexGuard
Model Description
ZentryPII-278M is a multilingual token classification model developed by LexGuard to detect and redact personally identifiable information (PII) from noisy automatic speech recognition (ASR) outputs. It is fine-tuned on synthetic ASR-style transcripts that include disfluencies, Hindi-English code-switching, and real-world conversational patterns.
- Developed by: [LexGuard]
- Funded by [optional]: LexGuard
- Shared by [optional]: Sanskar Pandey
- Model type: Token Classification (NER)
- Language(s) (NLP): English, Hindi
- License: Apache 2.0
Uses
Direct Use
ZentryPII-278M is intended for direct use in redacting PII from ASR transcripts across multilingual, informal, or code-switched contexts. Users can apply it to:
- Transcribed audio from customer support calls
- Patient interviews and medical notes
- Legal and financial voice dictations
- Internal company meetings
It can be used via Hugging Face pipelines or as part of a preprocessing module in privacy-sensitive workflows.
Out-of-Scope Use
This model is not intended for:
- Document-level NER on structured or formal text (e.g. PDFs, contracts)
- Coreference resolution or full conversational entity linking
- Real-time inference in low-resource, on-device settings without optimization
- Use in adversarial or surveillance applications that violate user privacy
Bias, Risks, and Limitations
- Cultural and spelling bias: Model was trained on synthetic English-Hindi examples and may underperform on other dialects or spelling variations.
- Disfluency confusion: In very noisy ASR outputs, model may struggle to distinguish PII from filler or background phrases.
- False positives/negatives: Names that are also common nouns (e.g. "Rose", "Paris") may be missed or over-flagged.
- No anonymization guarantees: While helpful, the model does not provide cryptographic or legal guarantees for PII anonymization.
Always verify redacted output before deployment in sensitive or regulated environments.
Recommendations
Users should be aware that ZentryPII-278M is optimized for ASR-style conversational input in English and Hindi. It should not be relied on as a sole mechanism for PII redaction in legally regulated environments. We recommend:
- Reviewing model output manually in high-risk domains such as healthcare or law
- Avoiding use in languages or dialects beyond those it was trained on
- Augmenting the model with rule-based fallback mechanisms for edge cases
- Retraining or fine-tuning on domain-specific data when applying to new use cases
ZentryPII-278M
How to Get Started with the Model
Use the code snippet below to run the model using ๐ค Transformers:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model = AutoModelForTokenClassification.from_pretrained("sanskxr02/zentrypii-278m")
tokenizer = AutoTokenizer.from_pretrained("sanskxr02/zentrypii-278m")
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
output = ner("i met rohit near connaught place at three thirty")
for ent in output:
print(f"{ent['word']} โ {ent['entity_group']}")
- Downloads last month
- 47
Model tree for sanskxr02/zentrypii-278m
Base model
FacebookAI/xlm-roberta-base