Personality Trait Predictor (Big Five)
This repository provides a machine learning pipeline for predicting the Big Five personality traits (OCEAN) from text input. It combines DistilBERT embeddings, LIWC-style linguistic features, and a set of Random Forest classifiers β one for each trait β trained on annotated personality data.
Predicted traits:
- Openness
- Conscientiousness
- Extraversion
- Agreeableness
- Emotional Stability
Each trait is predicted as a categorical label: low
, medium
, or high
.
How It Works
- Training was done with PANDORA dataset.
- Embeddings are extracted using the pretrained model
distilbert/distilbert-base-cased-distilled-squad
. - 64 LIWC features are extracted using a dictionary for mapping (
output.dic
). - Both features are concatenated and passed into a trait-specific Random Forest classifier.
- Predictions are returned as string labels for all five traits.
- There are 5 separate random forests for each personality trait - each optimized separately with different hyperparameters for achieving a fair and scientific -unbiased- prediction model.
- The hyperparameter optimization was based on accuracy and f1-score and predicting all the labels (unbiased).
- For each random forest classifier a different n_estimator and max_depth is set to achieve the optimum.
Example Usage
from personality_model import PersonalityClassifier
model = PersonalityClassifier()
text = "I enjoy solving challenging problems and thinking about philosophical questions."
predictions = model.predict_all_traits(text)
print(predictions)
# Output:
# {
# 'Openness': 'high',
# 'Conscientiousness': 'medium',
# 'Extraversion': 'low',
# 'Agreeableness': 'medium',
# 'Emotional stability': 'low'
# }
How to use:
After cloning the project, you can run two files to get the predictions:
For a demo use test_personality_model.py A text passage is provided inside the script that will be used for getting predictions.
For working with structured data with text, use predict_from_csv.ipynb Change the input and output path to path to test data. The script will extract texts under Q1, Q2 and Q3, concatenate them and passes the full text into the PersonalityClassifier(). It will automatically replace (or pass) model predictions for the five personalities. All the other columns of the CSV file will be untouched. The output is the same CSV file with labels for the OCEAN traits.
Installation
We recommend creating a conda environment. Clone the repository and install dependencies:
git clone https://huggingface.co/Arash-Alborz/personality-trait-predictor
cd personality-trait-predictor
# Create a conda environment
conda create -n personality_env python=3.9
conda activate personality_env
# Install dependencies
pip install -r requirements.txt
Project Structure
personality-trait-predictor/
βββ personality_model.py # Main class for prediction
βββ requirements.txt
βββ README.md
βββ .gitignore
βββ .gitattributes
βββ models/
β βββ feature_scaler.pkl # StandardScaler for feature scaling
β βββ output.dic # LIWC-style dictionary
β βββ openness_classifier.pkl # Classifiers ...
β βββ conscientiousness_classifier.pkl
β βββ extraversion_classifier.pkl
β βββ agreeableness_classifier.pkl
β βββ emotional_stability_classifier.pkl
βββ feature_extraction/
β βββ __init__.py
β βββ embedding_from_text.py # Embeddings extraction with BERT
β βββ liwc_from_text.py # LIWC feature extraction
Model Details
- Embeddings:
DistilBERT
(CLS token fromdistilbert-base-cased-distilled-squad
) - Linguistic Features: Word count vectors from a custom LIWC dictionary
- Classifier: One
RandomForestClassifier
per trait, tuned with custom hyperparameters - Scaling: Features are scaled using
StandardScaler
- Labels: Traits are categorized into
low
,medium
, orhigh
Training & Evaluation
- Each trait classifier was trained on a labeled dataset using combined BERT+LIWC features.
- Validation was performed on a separate set simulating job interview answers.
- Random Forest hyperparameters (e.g.,
n_estimators
,max_depth
) were manually optimized per trait for best F1-score.
For more information about evaluation, visit our project GitHub page
Evaluation Results of the final optimized model(Validation Set)
Trait | Accuracy | Macro F1-score |
---|---|---|
Openness | 0.62 | 0.47 |
Conscientiousness | 0.62 | 0.48 |
Extraversion | 0.47 | 0.44 |
Agreeableness | 0.38 | 0.36 |
Emotional stability | 0.53 | 0.47 |
Notes
- The model does not use Hugging Faceβs
pipeline()
interface because it integrates custom feature engineering steps. - You can import
PersonalityClassifier
directly to use the model.
Requirements
Dependencies include:
- numpy
- pandas
- scikit-learn
- torch
- transformers
- joblib
- tqdm
Author
University of Antwerp β "Digital Text Analysis", AMIV-NLP-2025
License
This project is licensed under the MIT License.