Personality Trait Predictor (Big Five)

This repository provides a machine learning pipeline for predicting the Big Five personality traits (OCEAN) from text input. It combines DistilBERT embeddings, LIWC-style linguistic features, and a set of Random Forest classifiers β€” one for each trait β€” trained on annotated personality data.

Predicted traits:

  • Openness
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Emotional Stability

Each trait is predicted as a categorical label: low, medium, or high.


How It Works

  • Training was done with PANDORA dataset.
  • Embeddings are extracted using the pretrained model distilbert/distilbert-base-cased-distilled-squad.
  • 64 LIWC features are extracted using a dictionary for mapping (output.dic).
  • Both features are concatenated and passed into a trait-specific Random Forest classifier.
  • Predictions are returned as string labels for all five traits.
  • There are 5 separate random forests for each personality trait - each optimized separately with different hyperparameters for achieving a fair and scientific -unbiased- prediction model.
  • The hyperparameter optimization was based on accuracy and f1-score and predicting all the labels (unbiased).
  • For each random forest classifier a different n_estimator and max_depth is set to achieve the optimum.

Example Usage

from personality_model import PersonalityClassifier

model = PersonalityClassifier()

text = "I enjoy solving challenging problems and thinking about philosophical questions."
predictions = model.predict_all_traits(text)

print(predictions)
# Output:
# {
#   'Openness': 'high',
#   'Conscientiousness': 'medium',
#   'Extraversion': 'low',
#   'Agreeableness': 'medium',
#   'Emotional stability': 'low'
# }

How to use:

After cloning the project, you can run two files to get the predictions:

  • For a demo use test_personality_model.py A text passage is provided inside the script that will be used for getting predictions.

  • For working with structured data with text, use predict_from_csv.ipynb Change the input and output path to path to test data. The script will extract texts under Q1, Q2 and Q3, concatenate them and passes the full text into the PersonalityClassifier(). It will automatically replace (or pass) model predictions for the five personalities. All the other columns of the CSV file will be untouched. The output is the same CSV file with labels for the OCEAN traits.

Installation

We recommend creating a conda environment. Clone the repository and install dependencies:

git clone https://huggingface.co/Arash-Alborz/personality-trait-predictor
cd personality-trait-predictor

# Create a conda environment
conda create -n personality_env python=3.9
conda activate personality_env

# Install dependencies
pip install -r requirements.txt

Project Structure

personality-trait-predictor/
β”œβ”€β”€ personality_model.py              # Main class for prediction
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
β”œβ”€β”€ .gitignore
β”œβ”€β”€ .gitattributes
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ feature_scaler.pkl                      # StandardScaler for feature scaling
β”‚   β”œβ”€β”€ output.dic                              # LIWC-style dictionary
β”‚   β”œβ”€β”€ openness_classifier.pkl                 # Classifiers ...
β”‚   β”œβ”€β”€ conscientiousness_classifier.pkl
β”‚   β”œβ”€β”€ extraversion_classifier.pkl
β”‚   β”œβ”€β”€ agreeableness_classifier.pkl
β”‚   β”œβ”€β”€ emotional_stability_classifier.pkl
β”œβ”€β”€ feature_extraction/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ embedding_from_text.py        # Embeddings extraction with BERT
β”‚   β”œβ”€β”€ liwc_from_text.py             # LIWC feature extraction

Model Details

  • Embeddings: DistilBERT (CLS token from distilbert-base-cased-distilled-squad)
  • Linguistic Features: Word count vectors from a custom LIWC dictionary
  • Classifier: One RandomForestClassifier per trait, tuned with custom hyperparameters
  • Scaling: Features are scaled using StandardScaler
  • Labels: Traits are categorized into low, medium, or high

Training & Evaluation

  • Each trait classifier was trained on a labeled dataset using combined BERT+LIWC features.
  • Validation was performed on a separate set simulating job interview answers.
  • Random Forest hyperparameters (e.g., n_estimators, max_depth) were manually optimized per trait for best F1-score.

For more information about evaluation, visit our project GitHub page

Evaluation Results of the final optimized model(Validation Set)

Trait Accuracy Macro F1-score
Openness 0.62 0.47
Conscientiousness 0.62 0.48
Extraversion 0.47 0.44
Agreeableness 0.38 0.36
Emotional stability 0.53 0.47

Notes

  • The model does not use Hugging Face’s pipeline() interface because it integrates custom feature engineering steps.
  • You can import PersonalityClassifier directly to use the model.

Requirements

Dependencies include:

  • numpy
  • pandas
  • scikit-learn
  • torch
  • transformers
  • joblib
  • tqdm

Author

University of Antwerp – "Digital Text Analysis", AMIV-NLP-2025


License

This project is licensed under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support