license: mit
tags:
- text-classification
- multitask
- toxicity
- misandry
- misogyny
- offensive-language
- english
- slovak
- xlm-roberta
🛡️ LexiGuard: Misogyny, Misandry & Toxicity Detection in English and Slovak
LexiGuard is a multilingual multitask model designed to detect and classify offensive language, with a focus on misogyny, misandry, and toxicity levels in English. The model also supports Slovak, making it suitable for multilingual analysis of social media content.
It performs dual classification:
- Category: Misogyny, Misandry, or Neutral
- Toxicity level: Low, Medium, or High
The model is based on xlm-roberta-base and was fine-tuned on a custom dataset primarily in English, with additional annotated samples in Slovak.
🧠 Model Overview
- Base model:
xlm-roberta-base
- Tasks: Multitask classification (2 output heads)
- Primary language: English
- Secondary language: Slovak
- Use case: Detecting offensive, sexist, or toxic comments in multilingual social media
🛠️ Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Megyy/lexiguard")
model = AutoModelForSequenceClassification.from_pretrained("Megyy/lexiguard")
text = "Women are useless in politics."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# outputs.logits contains predictions for both tasks
Note: The model has two output heads:
- Head 1: Category (misogyny/misandry/neutral)
- Head 2: Toxicity (low/medium/high)
📊 Label Definitions
Task 1 – Category Classification
0
: Neutral1
: Misogyny2
: Misandry
Task 2 – Toxicity Prediction
0
: Low1
: Medium2
: High
🧪 Training Data
- Over 5,000 manually annotated comments
- Domain: Online discussions, social media, and forums
- Language distribution:
- ~80% English
- ~20% Slovak
📁 Model Files
pytorch_model.bin
/model.safetensors
: model weightsconfig.json
: model configurationtokenizer.json
,vocab.txt
, etc.: tokenizer filesREADME.md
: model card
📚 Citation
If you use this model in your work, please cite:
@bachelorsthesis{majercakova2025lexiguard,
title={LexiGuard: Offensive Language Detection in English and Slovak Social Media},
author={Magdalena Majercakova},
year={2025},
note={Bachelor's thesis, TUKE},
}
👨💻 Author
Developed by Magdaléna Majerčáková as part of a Bachelor's Thesis
Supervised by Ing. Zuzana Sokolová, PhD
Faculty of Electrical Engineering and Informatics, TUKE (2025)