Personality Trait Predictor (Big Five)

This repository provides a machine learning pipeline for predicting the Big Five personality traits (OCEAN) from text input. It combines DistilBERT embeddings, LIWC-style linguistic features, and a set of Random Forest classifiers — one for each trait — trained on annotated personality data.

Predicted traits:

Openness
Conscientiousness
Extraversion
Agreeableness
Emotional Stability

Each trait is predicted as a categorical label: low, medium, or high.

How It Works

Training was done with PANDORA dataset.
Embeddings are extracted using the pretrained model distilbert/distilbert-base-cased-distilled-squad.
64 LIWC features are extracted using a dictionary for mapping (output.dic).
Both features are concatenated and passed into a trait-specific Random Forest classifier.
Predictions are returned as string labels for all five traits.
There are 5 separate random forests for each personality trait - each optimized separately with different hyperparameters for achieving a fair and scientific -unbiased- prediction model.
The hyperparameter optimization was based on accuracy and f1-score and predicting all the labels (unbiased).
For each random forest classifier a different n_estimator and max_depth is set to achieve the optimum.

Example Usage

from personality_model import PersonalityClassifier

model = PersonalityClassifier()

text = "I enjoy solving challenging problems and thinking about philosophical questions."
predictions = model.predict_all_traits(text)

print(predictions)
# Output:
# {
#   'Openness': 'high',
#   'Conscientiousness': 'medium',
#   'Extraversion': 'low',
#   'Agreeableness': 'medium',
#   'Emotional stability': 'low'
# }

How to use:

After cloning the project, you can run two files to get the predictions:

For a demo use test_personality_model.py A text passage is provided inside the script that will be used for getting predictions.
For working with structured data with text, use predict_from_csv.ipynb Change the input and output path to path to test data. The script will extract texts under Q1, Q2 and Q3, concatenate them and passes the full text into the PersonalityClassifier(). It will automatically replace (or pass) model predictions for the five personalities. All the other columns of the CSV file will be untouched. The output is the same CSV file with labels for the OCEAN traits.

Installation

We recommend creating a conda environment. Clone the repository and install dependencies:

git clone https://huggingface.co/Arash-Alborz/personality-trait-predictor
cd personality-trait-predictor

# Create a conda environment
conda create -n personality_env python=3.9
conda activate personality_env

# Install dependencies
pip install -r requirements.txt

Project Structure

personality-trait-predictor/
├── personality_model.py              # Main class for prediction
├── requirements.txt
├── README.md
├── .gitignore
├── .gitattributes
├── models/
│   ├── feature_scaler.pkl                      # StandardScaler for feature scaling
│   ├── output.dic                              # LIWC-style dictionary
│   ├── openness_classifier.pkl                 # Classifiers ...
│   ├── conscientiousness_classifier.pkl
│   ├── extraversion_classifier.pkl
│   ├── agreeableness_classifier.pkl
│   ├── emotional_stability_classifier.pkl
├── feature_extraction/
│   ├── __init__.py
│   ├── embedding_from_text.py        # Embeddings extraction with BERT
│   ├── liwc_from_text.py             # LIWC feature extraction

Model Details

Embeddings: DistilBERT (CLS token from distilbert-base-cased-distilled-squad)
Linguistic Features: Word count vectors from a custom LIWC dictionary
Classifier: One RandomForestClassifier per trait, tuned with custom hyperparameters
Scaling: Features are scaled using StandardScaler
Labels: Traits are categorized into low, medium, or high

Training & Evaluation

Each trait classifier was trained on a labeled dataset using combined BERT+LIWC features.
Validation was performed on a separate set simulating job interview answers.
Random Forest hyperparameters (e.g., n_estimators, max_depth) were manually optimized per trait for best F1-score.

For more information about evaluation, visit our project GitHub page

Evaluation Results of the final optimized model(Validation Set)

Trait	Accuracy	Macro F1-score
Openness	0.62	0.47
Conscientiousness	0.62	0.48
Extraversion	0.47	0.44
Agreeableness	0.38	0.36
Emotional stability	0.53	0.47

Notes

The model does not use Hugging Face’s pipeline() interface because it integrates custom feature engineering steps.
You can import PersonalityClassifier directly to use the model.

Requirements

Dependencies include:

numpy
pandas
scikit-learn
torch
transformers
joblib
tqdm

Author

University of Antwerp – "Digital Text Analysis", AMIV-NLP-2025

License

This project is licensed under the MIT License.