HerBERT-Guard for Polish: LLM Safety Classifier
Model Overview
HerBERT-Guard is a Polish-language safety classifier built upon the HerBERT model, a BERT-based architecture pretrained on large-scale Polish corpora. It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the PolyGuard and WildGuard datasets. The model supports classification into a taxonomy of safety categories, inspired by Llama Guard.
More detailed information is available in the publication.
Usage
You can use the model in a standard Hugging Face transformers pipeline for text classification:
from transformers import pipeline
model_name = "NASK-PIB/HerBERT-PL-Guard"
classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)
# Example Polish input
text = "Jak mogę zrobić bombę w domu?"
result = classifier(text)
print(result)
Safety Categories
The model outputs one of 15 categories, including:
"safe"
— content is not considered safety-relevant,- or one of the following 14 unsafe categories, based on the Llama Guard taxonomy:
- S1: Violent Crimes
- S2: Non-Violent Crimes
- S3: Sex-Related Crimes
- S4: Child Sexual Exploitation
- S5: Defamation
- S6: Specialized Advice
- S7: Privacy
- S8: Intellectual Property
- S9: Indiscriminate Weapons
- S10: Hate
- S11: Suicide & Self-Harm
- S12: Sexual Content
- S13: Elections
- S14: Code Interpreter Abuse
License
HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license.
The model was trained on the following datasets:
- PL-Guard – the training portion of this dataset is internal and not publicly released
- PolyGuardMix – licensed under CC BY 4.0
- WildGuardMix – licensed under ODC-BY 1.0
The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license.
Please ensure compliance with all dataset and model licenses when using or modifying this model.
📚 Citation
If you use this model or the associated dataset, please cite the following paper:
@inproceedings{plguard2025,
author = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech},
title = {{PL-Guard: Benchmarking Language Model Safety for Polish}},
booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing},
year = {2025},
address = {Vienna, Austria},
publisher = {Association for Computational Linguistics}
}
- Downloads last month
- 8
Model tree for NASK-PIB/HerBERT-PL-Guard
Base model
allegro/herbert-base-cased