--- library_name: transformers tags: - pii - ner - asr - redaction - privacy - lexguard - xlm-roberta - multilingual - huggingface - token-classification license: apache-2.0 language: - en - hi metrics: - seqeval base_model: - FacebookAI/xlm-roberta-base --- [![Hugging Face](https://img.shields.io/badge/HuggingFace-ZentryPII-yellow)](https://huggingface.co/sanskxr02/zentrypii-278m) # ZentryPII-278M – A LexGuard Model ZentryPII-278M is a multilingual token classification model fine-tuned to identify and redact personally identifiable information (PII) — such as names, locations, and time expressions — from noisy, ASR-style transcripts in English and Hindi. Built on top of XLM-RoBERTa-base, it is designed to serve as the redaction engine for LexGuard’s privacy-preserving speech-to-text workflows. ## Model Details - **Model Name:** ZentryPII-278M - **Architecture:** XLM-RoBERTa-base - **Parameters:** ~278M - **Task:** Token Classification (NER-style) - **Labels:** B-NAME, B-LOC, B-TIME, O - **Languages:** English, Hindi - **Training Dataset:** Synthetic ASR-style BIO-labeled dataset (~1,000 samples) - **Fine-tuning Epochs:** 5 - **Framework:** Hugging Face Transformers - **Developer:** LexGuard ### Model Description ZentryPII-278M is a multilingual token classification model developed by LexGuard to detect and redact personally identifiable information (PII) from noisy automatic speech recognition (ASR) outputs. It is fine-tuned on synthetic ASR-style transcripts that include disfluencies, Hindi-English code-switching, and real-world conversational patterns. - **Developed by:** [LexGuard] - **Funded by [optional]:** LexGuard - **Shared by [optional]:** Sanskar Pandey - **Model type:** Token Classification (NER) - **Language(s) (NLP):** English, Hindi - **License:** Apache 2.0 ## Uses ### Direct Use ZentryPII-278M is intended for direct use in redacting PII from ASR transcripts across multilingual, informal, or code-switched contexts. Users can apply it to: - Transcribed audio from customer support calls - Patient interviews and medical notes - Legal and financial voice dictations - Internal company meetings It can be used via Hugging Face pipelines or as part of a preprocessing module in privacy-sensitive workflows. ### Out-of-Scope Use This model is not intended for: - Document-level NER on structured or formal text (e.g. PDFs, contracts) - Coreference resolution or full conversational entity linking - Real-time inference in low-resource, on-device settings without optimization - Use in adversarial or surveillance applications that violate user privacy --- ## Bias, Risks, and Limitations - **Cultural and spelling bias:** Model was trained on synthetic English-Hindi examples and may underperform on other dialects or spelling variations. - **Disfluency confusion:** In very noisy ASR outputs, model may struggle to distinguish PII from filler or background phrases. - **False positives/negatives:** Names that are also common nouns (e.g. "Rose", "Paris") may be missed or over-flagged. - **No anonymization guarantees:** While helpful, the model does not provide cryptographic or legal guarantees for PII anonymization. Always verify redacted output before deployment in sensitive or regulated environments. ### Recommendations Users should be aware that ZentryPII-278M is optimized for ASR-style conversational input in English and Hindi. It should not be relied on as a sole mechanism for PII redaction in legally regulated environments. We recommend: - Reviewing model output manually in high-risk domains such as healthcare or law - Avoiding use in languages or dialects beyond those it was trained on - Augmenting the model with rule-based fallback mechanisms for edge cases - Retraining or fine-tuning on domain-specific data when applying to new use cases --- # ZentryPII-278M ## How to Get Started with the Model Use the code snippet below to run the model using 🤗 Transformers: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model = AutoModelForTokenClassification.from_pretrained("sanskxr02/zentrypii-278m") tokenizer = AutoTokenizer.from_pretrained("sanskxr02/zentrypii-278m") ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") output = ner("i met rohit near connaught place at three thirty") for ent in output: print(f"{ent['word']} → {ent['entity_group']}")