ZentryPII-278M – A LexGuard Model

ZentryPII-278M is a multilingual token classification model fine-tuned to identify and redact personally identifiable information (PII) — such as names, locations, and time expressions — from noisy, ASR-style transcripts in English and Hindi. Built on top of XLM-RoBERTa-base, it is designed to serve as the redaction engine for LexGuard’s privacy-preserving speech-to-text workflows.

Model Details

Model Name: ZentryPII-278M
Architecture: XLM-RoBERTa-base
Parameters: ~278M
Task: Token Classification (NER-style)
Labels: B-NAME, B-LOC, B-TIME, O
Languages: English, Hindi
Training Dataset: Synthetic ASR-style BIO-labeled dataset (~1,000 samples)
Fine-tuning Epochs: 5
Framework: Hugging Face Transformers
Developer: LexGuard

Model Description

ZentryPII-278M is a multilingual token classification model developed by LexGuard to detect and redact personally identifiable information (PII) from noisy automatic speech recognition (ASR) outputs. It is fine-tuned on synthetic ASR-style transcripts that include disfluencies, Hindi-English code-switching, and real-world conversational patterns.

Developed by: [LexGuard]
Funded by [optional]: LexGuard
Shared by [optional]: Sanskar Pandey
Model type: Token Classification (NER)
Language(s) (NLP): English, Hindi
License: Apache 2.0

Uses

Direct Use

ZentryPII-278M is intended for direct use in redacting PII from ASR transcripts across multilingual, informal, or code-switched contexts. Users can apply it to:

Transcribed audio from customer support calls
Patient interviews and medical notes
Legal and financial voice dictations
Internal company meetings

It can be used via Hugging Face pipelines or as part of a preprocessing module in privacy-sensitive workflows.

Out-of-Scope Use

This model is not intended for:

Document-level NER on structured or formal text (e.g. PDFs, contracts)
Coreference resolution or full conversational entity linking
Real-time inference in low-resource, on-device settings without optimization
Use in adversarial or surveillance applications that violate user privacy

Bias, Risks, and Limitations

Cultural and spelling bias: Model was trained on synthetic English-Hindi examples and may underperform on other dialects or spelling variations.
Disfluency confusion: In very noisy ASR outputs, model may struggle to distinguish PII from filler or background phrases.
False positives/negatives: Names that are also common nouns (e.g. "Rose", "Paris") may be missed or over-flagged.
No anonymization guarantees: While helpful, the model does not provide cryptographic or legal guarantees for PII anonymization.

Always verify redacted output before deployment in sensitive or regulated environments.

Recommendations

Users should be aware that ZentryPII-278M is optimized for ASR-style conversational input in English and Hindi. It should not be relied on as a sole mechanism for PII redaction in legally regulated environments. We recommend:

Reviewing model output manually in high-risk domains such as healthcare or law
Avoiding use in languages or dialects beyond those it was trained on
Augmenting the model with rule-based fallback mechanisms for edge cases
Retraining or fine-tuning on domain-specific data when applying to new use cases

ZentryPII-278M

How to Get Started with the Model

Use the code snippet below to run the model using 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model = AutoModelForTokenClassification.from_pretrained("sanskxr02/zentrypii-278m")
tokenizer = AutoTokenizer.from_pretrained("sanskxr02/zentrypii-278m")

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

output = ner("i met rohit near connaught place at three thirty")
for ent in output:
    print(f"{ent['word']} → {ent['entity_group']}")

sanskxr02
/

zentrypii-278m