--- language: - sk license: apache-2.0 tags: - hate-speech - text-cleaning - slovak - t5 - transformer datasets: - None metrics: - accuracy pipeline_tag: text2text-generation --- # Model Card: SlovHC – Slovak Hate Speech Corrector ## Model Overview **SlovHC** is a fine-tuned model designed specifically for correcting hate speech in the Slovak language. Given the limited availability of robust language models for low-resource languages like Slovak, SlovHC aims to fill this gap by offering high-quality performance in identifying and masking offensive content. ## Key Features * Tailored for the Slovak language * Focuses on masking hate speech while preserving sentence structure * Utilizes pre-trained Slovak BERT tokenizer for consistent tokenization ## Example Outputs **Input:** Ty si absolútny magor. **Output:** Ty si absolútny \*\*\*\*\*. --- **Input:** Priblblé električky stále meškajú. **Output:** \*\*\*\*\*\*\*\*\* električky stále meškajú. --- **Input:** Opač jak ši sebe obľik tote nohavky, ši jak mantak. **Output:** Opač jak ši sebe obľik tote nohavky, ši jak \*\*\*\*\*\*. ## Tokenizer We did not develop a new tokenizer for this model. Instead, we leveraged the high-quality tokenizer provided by [`gerulata/slovakbert`](https://huggingface.co/gerulata/slovakbert), which aligns well with our model’s requirements. ## How to Use Here's a simple example demonstrating how to load and run inference with the model: ```python from transformers import RobertaTokenizer, AutoModelForSeq2SeqLM # Load pretrained tokenizer and model tokenizer = RobertaTokenizer.from_pretrained( 'gerulata/slovakbert', weights_only=False, token="###YOUR_HF_TOKEN###" ) model = AutoModelForSeq2SeqLM.from_pretrained( "timotejKralik/hate_speech_correction_slovak", weights_only=False, token="###YOUR_HF_TOKEN###" ) # Input text containing potentially harmful language input_text = "Opač jak ši sebe obľik tote nohavky, ši jak mantak." print(f"Input: {input_text}") # Tokenize input and generate output inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) output_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print("Output:", output_text) ```