File size: 6,614 Bytes
ef5207b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6f8378
 
ef5207b
 
f2dc063
12eb650
 
fae764c
12eb650
 
 
 
 
bfe9c44
12eb650
 
 
 
 
 
 
 
 
 
 
 
 
f2dc063
8c251cc
12eb650
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40b4f3d
12eb650
 
 
 
 
 
 
 
 
 
 
06488b6
 
12eb650
 
06488b6
12eb650
 
06488b6
12eb650
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2dc063
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12eb650
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
language: en
tags:
  - text-classification
  - gender
  - gender-prediction
  - transformers
  - deberta
license: mit
datasets:
  - samzirbo/europarl.en-es.gendered
  - czyzi0/luna-speech-dataset
  - czyzi0/pwr-azon-speech-dataset
  - sagteam/author_profiling
  - kaushalgawri/nptel-en-tags-and-gender-v0
metrics:
  - accuracy
  - f1
  - precision
  - recall
base_model: microsoft/deberta-v3-large
pipeline_tag: text-classification
model-index:
  - name: gender_prediction_model_from_text
    results:
      - task:
          type: text-classification
          name: Text Classification
        metrics:
          - type: f1
            value: 0.69
          - type: accuracy
            value: 0.69
citations:
  - "@misc{fc63_gender1_2025,\n  title = {Gender Prediction from Text},\n  author = {Γ‡oban, Furkan},\n  year = {2025},\n  howpublished = {\\url{https://doi.org/10.5281/zenodo.15619489}},\n  note = {DeBERTa-v3-large model fine-tuned on multi-domain gender-labeled texts}\n}"
---


# Gender Prediction from Text ✍️ β†’ πŸ‘©β€πŸ¦°πŸ‘¨

This model **predicts** the likely **gender** of an anonymous speaker or writer based solely on the content of an English text. It is built upon [DeBERTa-v3-large](https://huggingface.co/microsoft/deberta-v3-large) and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.

πŸ“ **Space link**: [πŸ”— Try it out on Hugging Face Spaces](https://huggingface.co/spaces/fc63/Gender_Prediction)  
πŸ“ **Model repo**: [πŸ”— View on Hugging Face Hub](https://huggingface.co/fc63/gender_prediction_model_from_text)  
🧠 **Source code**: [GitHub](https://github.com/fc63/gender-classification)

---

## πŸ“Š Model Summary

- **Base model**: `microsoft/deberta-v3-large`
- **Fine-tuned on**: binary gender classification task (`female` vs `male`)
- **Best F1 Score**: `0.69` on a balanced multi-domain test set
- **Max token length**: 128
- **Evaluation Metrics**:
  - F1: 0.69
  - Accuracy: 0.69
  - Precision: 0.69
  - Recall: 0.69

πŸ“‚ **Evaluation**: [View on Notebook](https://github.com/fc63/gender-classification/blob/main/Evaluate/modelv3.ipynb)

---

## 🧾 Datasets Used

| Dataset | Domain | Type |
|--------|--------|------|
| [samzirbo/europarl.en-es.gendered](https://huggingface.co/datasets/samzirbo/europarl.en-es.gendered) | Formal speech (Parliament) | English |
| [czyzi0/luna-speech-dataset](https://huggingface.co/datasets/czyzi0/luna-speech-dataset) | Phone conversations | Polish β†’ Translated |
| [czyzi0/pwr-azon-speech-dataset](https://huggingface.co/datasets/czyzi0/pwr-azon-speech-dataset) | Phone conversations | Polish β†’ Translated |
| [sagteam/author_profiling](https://huggingface.co/datasets/sagteam/author_profiling) | Social posts | Russian β†’ Translated |
| [kaushalgawri/nptel-en-tags-and-gender-v0](https://huggingface.co/datasets/kaushalgawri/nptel-en-tags-and-gender-v0) | Spoken transcripts | English |
| [Blog Authorship Corpus](https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm) | Blog posts | English |

All datasets were normalized, translated if necessary, deduplicated, and **balanced via random undersampling** to ensure equal representation of both genders.

---

## πŸ› οΈ Preprocessing & Training

- **Normalization**: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets.
- **Translation**: Used `Helsinki-NLP/opus-mt-*` models for Polish and Russian data.
- **Undersampling**: Random undersampling to balance male and female samples.
- **Training Strategy**:
  - LR Finder used to optimize learning rate (`2.66e-6`)
  - Fine-tuned using early stopping on both F1 and loss
  - Step-based evaluation every 250 steps
  - Best checkpoint at step 24,750 saved and evaluated
- **Second Phase Fine-tuning**:
  - Performed on full merged dataset for 2 epochs
  - Used cosine learning rate scheduler and warm-up steps

---

## πŸ“ˆ Performance (on full merged test set)

| Class | Precision | Recall | F1-Score | Accuracy | Support |
|-----|-----|--------|----------|---------|---------|
| Female | 0.70 | 0.65 | 0.68 | | 591,027 |
| Male   | 0.68 | 0.72 | 0.70 | | 591,027 |
| **Macro Avg** | 0.69 | 0.69 | **0.69** | | 1,182,054 |
| **Accuracy**  |           |        | | **0.69** | 1,182,054 |

---

## πŸ“¦ Usage Example

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "fc63/gender_prediction_model_from_text"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to(device)

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=1)
    pred = torch.argmax(probs, dim=1).item()
    confidence = round(probs[0][pred].item() * 100, 1)
    gender = "Female" if pred == 0 else "Male"
    return f"{gender} (Confidence: {confidence}%)"
```
```
sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow."
print(predict(sample_text))
```
The Output Of This Sample:
```
Female (Confidence: 84.1%)
```

---

## πŸ“Œ Future Work & Limitations


I do not want to leave this model at the level of 0.69 accuracy and F1 score.

As far as I can detect at this point, there is a bias towards predicting emotional, psychological, and introspective texts as female. Similarly, more direct and result-oriented writings are also often predicted as male. Therefore, a large, carefully labeled dataset that reflects the opposite of this pattern is needed.

The datasets used to train this model had to be obtained from open-source platforms, which limited the range of accessible data.

To make further progress, I need to create and label a larger dataset myself β€” which requires a significant amount of time, effort, and cost.

Before moving to dataset creation, I plan to try a few more approaches using the current dataset. So far, alternative techniques have not helped improve the scores without causing overfitting. After testing a few more methods, if none work, the only step left will be building a new dataset β€” and that will likely be the point where I stop development, as it will be both labor-intensive and costly for me.

---

## πŸ‘¨β€πŸ”¬ Author & License

**Author**: Furkan Γ‡oban  
**Project**: CENG-481 Gender Prediction Model  
**License**: MIT