File size: 6,614 Bytes
ef5207b e6f8378 ef5207b f2dc063 12eb650 fae764c 12eb650 bfe9c44 12eb650 f2dc063 8c251cc 12eb650 40b4f3d 12eb650 06488b6 12eb650 06488b6 12eb650 06488b6 12eb650 f2dc063 12eb650 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
language: en
tags:
- text-classification
- gender
- gender-prediction
- transformers
- deberta
license: mit
datasets:
- samzirbo/europarl.en-es.gendered
- czyzi0/luna-speech-dataset
- czyzi0/pwr-azon-speech-dataset
- sagteam/author_profiling
- kaushalgawri/nptel-en-tags-and-gender-v0
metrics:
- accuracy
- f1
- precision
- recall
base_model: microsoft/deberta-v3-large
pipeline_tag: text-classification
model-index:
- name: gender_prediction_model_from_text
results:
- task:
type: text-classification
name: Text Classification
metrics:
- type: f1
value: 0.69
- type: accuracy
value: 0.69
citations:
- "@misc{fc63_gender1_2025,\n title = {Gender Prediction from Text},\n author = {Γoban, Furkan},\n year = {2025},\n howpublished = {\\url{https://doi.org/10.5281/zenodo.15619489}},\n note = {DeBERTa-v3-large model fine-tuned on multi-domain gender-labeled texts}\n}"
---
# Gender Prediction from Text βοΈ β π©βπ¦°π¨
This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
π **Space link**: [π Try it out on Hugging Face Spaces](https://huggingface.co/spaces/fc63/Gender_Prediction)
π **Model repo**: [π View on Hugging Face Hub](https://huggingface.co/fc63/gender_prediction_model_from_text)
π§ **Source code**: [GitHub](https://github.com/fc63/gender-classification)
---
## π Model Summary
- **Base model**: `microsoft/deberta-v3-large`
- **Fine-tuned on**: binary gender classification task (`female` vs `male`)
- **Best F1 Score**: `0.69` on a balanced multi-domain test set
- **Max token length**: 128
- **Evaluation Metrics**:
- F1: 0.69
- Accuracy: 0.69
- Precision: 0.69
- Recall: 0.69
π **Evaluation**: [View on Notebook](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)
---
## π§Ύ Datasets Used
| Dataset | Domain | Type |
|--------|--------|------|
| [samzirbo/europarl.en-es.gendered](https://huggingface.co/datasets/samzirbo/europarl.en-es.gendered) | Formal speech (Parliament) | English |
| [czyzi0/luna-speech-dataset](https://huggingface.co/datasets/czyzi0/luna-speech-dataset) | Phone conversations | Polish β Translated |
| [czyzi0/pwr-azon-speech-dataset](https://huggingface.co/datasets/czyzi0/pwr-azon-speech-dataset) | Phone conversations | Polish β Translated |
| [sagteam/author_profiling](https://huggingface.co/datasets/sagteam/author_profiling) | Social posts | Russian β Translated |
| [kaushalgawri/nptel-en-tags-and-gender-v0](https://huggingface.co/datasets/kaushalgawri/nptel-en-tags-and-gender-v0) | Spoken transcripts | English |
| [Blog Authorship Corpus](https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm) | Blog posts | English |
All datasets were normalized, translated if necessary, deduplicated, and **balanced via random undersampling** to ensure equal representation of both genders.
---
## π οΈ Preprocessing & Training
- **Normalization**: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets.
- **Translation**: Used `Helsinki-NLP/opus-mt-*` models for Polish and Russian data.
- **Undersampling**: Random undersampling to balance male and female samples.
- **Training Strategy**:
- LR Finder used to optimize learning rate (`2.66e-6`)
- Fine-tuned using early stopping on both F1 and loss
- Step-based evaluation every 250 steps
- Best checkpoint at step 24,750 saved and evaluated
- **Second Phase Fine-tuning**:
- Performed on full merged dataset for 2 epochs
- Used cosine learning rate scheduler and warm-up steps
---
## π Performance (on full merged test set)
| Class | Precision | Recall | F1-Score | Accuracy | Support |
|-----|-----|--------|----------|---------|---------|
| Female | 0.70 | 0.65 | 0.68 | | 591,027 |
| Male | 0.68 | 0.72 | 0.70 | | 591,027 |
| **Macro Avg** | 0.69 | 0.69 | **0.69** | | 1,182,054 |
| **Accuracy** | | | | **0.69** | 1,182,054 |
---
## π¦ Usage Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "fc63/gender_prediction_model_from_text"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to(device)
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = F.softmax(outputs.logits, dim=1)
pred = torch.argmax(probs, dim=1).item()
confidence = round(probs[0][pred].item() * 100, 1)
gender = "Female" if pred == 0 else "Male"
return f"{gender} (Confidence: {confidence}%)"
```
```
sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow."
print(predict(sample_text))
```
The Output Of This Sample:
```
Female (Confidence: 84.1%)
```
---
## π Future Work & Limitations
I do not want to leave this model at the level of 0.69 accuracy and F1 score.
As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed.
The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data.
To make further progress, I need to create and label a larger dataset myself β which requires a significant amount of time, effort, and cost.
Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset β and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me.
---
## π¨βπ¬ Author & License
**Author**: Furkan Γoban
**Project**: CENG-481 Gender Prediction Model
**License**: MIT
|