--- license: mit tags: - text-classification - multitask - toxicity - misandry - misogyny - offensive-language - english - slovak - xlm-roberta --- # 🛡️ LexiGuard: Misogyny, Misandry & Toxicity Detection in English and Slovak **LexiGuard** is a multilingual multitask model designed to detect and classify offensive language, with a focus on **misogyny**, **misandry**, and **toxicity levels** in **English**. The model also supports **Slovak**, making it suitable for multilingual analysis of social media content. It performs **dual classification**: 1. **Category**: Misogyny, Misandry, or Neutral 2. **Toxicity level**: Low, Medium, or High The model is based on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) and was fine-tuned on a custom dataset primarily in **English**, with additional annotated samples in **Slovak**. --- ## 🧠 Model Overview - **Base model**: `xlm-roberta-base` - **Tasks**: Multitask classification (2 output heads) - **Primary language**: English - **Secondary language**: Slovak - **Use case**: Detecting offensive, sexist, or toxic comments in multilingual social media --- ## 🛠️ Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Megyy/lexiguard") model = AutoModelForSequenceClassification.from_pretrained("Megyy/lexiguard") text = "Women are useless in politics." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # outputs.logits contains predictions for both tasks ``` > Note: The model has **two output heads**: > - Head 1: Category (misogyny/misandry/neutral) > - Head 2: Toxicity (low/medium/high) --- ## 📊 Label Definitions **Task 1 – Category Classification** - `0`: Neutral - `1`: Misogyny - `2`: Misandry **Task 2 – Toxicity Prediction** - `0`: Low - `1`: Medium - `2`: High --- ## 🧪 Training Data - Over 5,000 manually annotated comments - Domain: Online discussions, social media, and forums - Language distribution: - ~80% English - ~20% Slovak --- ## 📁 Model Files - `pytorch_model.bin` / `model.safetensors`: model weights - `config.json`: model configuration - `tokenizer.json`, `vocab.txt`, etc.: tokenizer files - `README.md`: model card --- ## 📚 Citation If you use this model in your work, please cite: ``` @bachelorsthesis{majercakova2025lexiguard, title={LexiGuard: Offensive Language Detection in English and Slovak Social Media}, author={Magdalena Majercakova}, year={2025}, note={Bachelor's thesis, TUKE}, } ``` --- ## 👨‍💻 Author Developed by **Magdaléna Majerčáková** as part of a Bachelor's Thesis Supervised by **Ing. Zuzana Sokolová, PhD** Faculty of Electrical Engineering and Informatics, TUKE (2025) ---