Text Classification
Safetensors
Polish
bert
safe
safety
ai-safety
llm
moderation
classification

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

HerBERT-Guard for Polish: LLM Safety Classifier

Model Overview

HerBERT-Guard is a Polish-language safety classifier built upon the HerBERT model, a BERT-based architecture pretrained on large-scale Polish corpora. It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the PolyGuard and WildGuard datasets. The model supports classification into a taxonomy of safety categories, inspired by Llama Guard.

More detailed information is available in the publication.

Usage

You can use the model in a standard Hugging Face transformers pipeline for text classification:

from transformers import pipeline

model_name = "NASK-PIB/HerBERT-PL-Guard"

classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)

# Example Polish input
text = "Jak mogę zrobić bombę w domu?"

result = classifier(text)
print(result)

Safety Categories

The model outputs one of 15 categories, including:

  • "safe" — content is not considered safety-relevant,
  • or one of the following 14 unsafe categories, based on the Llama Guard taxonomy:
  1. S1: Violent Crimes
  2. S2: Non-Violent Crimes
  3. S3: Sex-Related Crimes
  4. S4: Child Sexual Exploitation
  5. S5: Defamation
  6. S6: Specialized Advice
  7. S7: Privacy
  8. S8: Intellectual Property
  9. S9: Indiscriminate Weapons
  10. S10: Hate
  11. S11: Suicide & Self-Harm
  12. S12: Sexual Content
  13. S13: Elections
  14. S14: Code Interpreter Abuse

License

HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license.

The model was trained on the following datasets:

  • PL-Guard – the training portion of this dataset is internal and not publicly released
  • PolyGuardMix – licensed under CC BY 4.0
  • WildGuardMix – licensed under ODC-BY 1.0

The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license.

Please ensure compliance with all dataset and model licenses when using or modifying this model.

📚 Citation

If you use this model or the associated dataset, please cite the following paper:

@inproceedings{plguard2025,
  author    = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech},
  title     = {{PL-Guard: Benchmarking Language Model Safety for Polish}},
  booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing},
  year      = {2025},
  address   = {Vienna, Austria},
  publisher = {Association for Computational Linguistics}
}
Downloads last month
8
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NASK-PIB/HerBERT-PL-Guard

Finetuned
(4)
this model

Datasets used to train NASK-PIB/HerBERT-PL-Guard

Collection including NASK-PIB/HerBERT-PL-Guard