|
--- |
|
language: en |
|
tags: |
|
- text-classification |
|
- gender |
|
- gender-prediction |
|
- transformers |
|
- deberta |
|
license: mit |
|
datasets: |
|
- samzirbo/europarl.en-es.gendered |
|
- czyzi0/luna-speech-dataset |
|
- czyzi0/pwr-azon-speech-dataset |
|
- sagteam/author_profiling |
|
- kaushalgawri/nptel-en-tags-and-gender-v0 |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- precision |
|
- recall |
|
base_model: microsoft/deberta-v3-large |
|
pipeline_tag: text-classification |
|
model-index: |
|
- name: gender_prediction_model_from_text |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Text Classification |
|
metrics: |
|
- type: f1 |
|
value: 0.69 |
|
- type: accuracy |
|
value: 0.69 |
|
citations: |
|
- "@misc{fc63_gender1_2025,\n title = {Gender Prediction from Text},\n author = {Γoban, Furkan},\n year = {2025},\n howpublished = {\\url{https://doi.org/10.5281/zenodo.15619489}},\n note = {DeBERTa-v3-large model fine-tuned on multi-domain gender-labeled texts}\n}" |
|
--- |
|
|
|
|
|
# Gender Prediction from Text βοΈ β π©βπ¦°π¨ |
|
|
|
This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts. |
|
|
|
π **Space link**: [π Try it out on Hugging Face Spaces](https://huggingface.co/spaces/fc63/Gender_Prediction) |
|
π **Model repo**: [π View on Hugging Face Hub](https://huggingface.co/fc63/gender_prediction_model_from_text) |
|
π§ **Source code**: [GitHub](https://github.com/fc63/gender-classification) |
|
|
|
--- |
|
|
|
## π Model Summary |
|
|
|
- **Base model**: `microsoft/deberta-v3-large` |
|
- **Fine-tuned on**: binary gender classification task (`female` vs `male`) |
|
- **Best F1 Score**: `0.69` on a balanced multi-domain test set |
|
- **Max token length**: 128 |
|
- **Evaluation Metrics**: |
|
- F1: 0.69 |
|
- Accuracy: 0.69 |
|
- Precision: 0.69 |
|
- Recall: 0.69 |
|
|
|
π **Evaluation**: [View on Notebook](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb) |
|
|
|
--- |
|
|
|
## π§Ύ Datasets Used |
|
|
|
| Dataset | Domain | Type | |
|
|--------|--------|------| |
|
| [samzirbo/europarl.en-es.gendered](https://huggingface.co/datasets/samzirbo/europarl.en-es.gendered) | Formal speech (Parliament) | English | |
|
| [czyzi0/luna-speech-dataset](https://huggingface.co/datasets/czyzi0/luna-speech-dataset) | Phone conversations | Polish β Translated | |
|
| [czyzi0/pwr-azon-speech-dataset](https://huggingface.co/datasets/czyzi0/pwr-azon-speech-dataset) | Phone conversations | Polish β Translated | |
|
| [sagteam/author_profiling](https://huggingface.co/datasets/sagteam/author_profiling) | Social posts | Russian β Translated | |
|
| [kaushalgawri/nptel-en-tags-and-gender-v0](https://huggingface.co/datasets/kaushalgawri/nptel-en-tags-and-gender-v0) | Spoken transcripts | English | |
|
| [Blog Authorship Corpus](https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm) | Blog posts | English | |
|
|
|
All datasets were normalized, translated if necessary, deduplicated, and **balanced via random undersampling** to ensure equal representation of both genders. |
|
|
|
--- |
|
|
|
## π οΈ Preprocessing & Training |
|
|
|
- **Normalization**: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets. |
|
- **Translation**: Used `Helsinki-NLP/opus-mt-*` models for Polish and Russian data. |
|
- **Undersampling**: Random undersampling to balance male and female samples. |
|
- **Training Strategy**: |
|
- LR Finder used to optimize learning rate (`2.66e-6`) |
|
- Fine-tuned using early stopping on both F1 and loss |
|
- Step-based evaluation every 250 steps |
|
- Best checkpoint at step 24,750 saved and evaluated |
|
- **Second Phase Fine-tuning**: |
|
- Performed on full merged dataset for 2 epochs |
|
- Used cosine learning rate scheduler and warm-up steps |
|
|
|
--- |
|
|
|
## π Performance (on full merged test set) |
|
|
|
| Class | Precision | Recall | F1-Score | Accuracy | Support | |
|
|-----|-----|--------|----------|---------|---------| |
|
| Female | 0.70 | 0.65 | 0.68 | | 591,027 | |
|
| Male | 0.68 | 0.72 | 0.70 | | 591,027 | |
|
| **Macro Avg** | 0.69 | 0.69 | **0.69** | | 1,182,054 | |
|
| **Accuracy** | | | | **0.69** | 1,182,054 | |
|
|
|
--- |
|
|
|
## π¦ Usage Example |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
model_name = "fc63/gender_prediction_model_from_text" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to(device) |
|
|
|
def predict(text): |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
probs = F.softmax(outputs.logits, dim=1) |
|
pred = torch.argmax(probs, dim=1).item() |
|
confidence = round(probs[0][pred].item() * 100, 1) |
|
gender = "Female" if pred == 0 else "Male" |
|
return f"{gender} (Confidence: {confidence}%)" |
|
``` |
|
``` |
|
sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow." |
|
print(predict(sample_text)) |
|
``` |
|
The Output Of This Sample: |
|
``` |
|
Female (Confidence: 84.1%) |
|
``` |
|
|
|
--- |
|
|
|
## π Future Work & Limitations |
|
|
|
|
|
I do not want to leave this model at the level of 0.69 accuracy and F1 score. |
|
|
|
As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed. |
|
|
|
The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data. |
|
|
|
To make further progress, I need to create and label a larger dataset myself β which requires a significant amount of time, effort, and cost. |
|
|
|
Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset β and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me. |
|
|
|
--- |
|
|
|
## π¨βπ¬ Author & License |
|
|
|
**Author**: Furkan Γoban |
|
**Project**: CENG-481 Gender Prediction Model |
|
**License**: MIT |
|
|