GDPR Violation Result Classifier

This repository contains a fine-tuned transformer model for predicting GDPR violation results based on case features.

Model Overview

  • Base Model: JQ1984/legalbert_gdpr_pretrained (a BERT model pre-trained on legal and GDPR-specific texts)
  • Task: Binary classification to predict violation results (0: no violation, 1: violation)
  • Training Method: 5-fold cross-validation with hyperparameter optimization

Dataset

The model was trained on a custom GDPR violation dataset containing real violation cases. The dataset includes:

  • 2,412 cases total (2,058 violations, 354 non-violations)
  • Features include affected data volume, countries, industry sectors, data categories, data processing basis, GDPR clauses, and various violation indicators
  • All categorical features were converted to text descriptions for the transformer model
  • Dataset link: https://huggingface.co/datasets/JQ1984/GDPRcasedata

Training Methodology

The training pipeline followed these steps:

  1. Text Conversion: All numerical and categorical features were converted to text descriptions
  2. K-Fold Cross-Validation: 5-fold cross-validation was used to ensure robust model performance
  3. Fine-tuning: LegalBERT model was fine-tuned on the classification task
  4. Hyperparameters:
    • Batch size: 16
    • Learning rate: 3e-5
    • Epochs: 3
    • Weight decay: 0.01
    • Optimizer: AdamW

Performance Metrics

The model achieved the following performance metrics across 5-fold cross-validation:

  • Average Accuracy: 95.03%
  • Average F1 Score: 89.33%
  • Average Precision: 92.79%
  • Average Recall: 86.60%

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
model_path = "YOUR_USERNAME/gdpr-violation-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Example text (format similar to training data)
text = "GDPR clauses are Art. 5, Art. 6. Date is 2022-05-15. country is Germany. company_industry is Technology. data_category_personal_data is true. data_processing_basis_consent is true."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
outputs = model(**inputs)
probabilities = outputs.logits.softmax(dim=-1)
predicted_class = outputs.logits.argmax(dim=-1).item()

print(f"Predicted class: {predicted_class}")
print(f"Class probabilities: {probabilities[0].tolist()}")

Contact

For questions, feedback, or collaboration opportunities, please contact: Jacques Qiu(邱耿航) Email: [email protected] GitHub: JacquotQ LinkedIn: https://www.linkedin.com/in/jacques-qiu-50477b266/

Downloads last month
18
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JQ1984/violation_result_GDPR_prediction

Finetuned
(3)
this model

Dataset used to train JQ1984/violation_result_GDPR_prediction