GDPR Violation Result Classifier
This repository contains a fine-tuned transformer model for predicting GDPR violation results based on case features.
Model Overview
- Base Model: JQ1984/legalbert_gdpr_pretrained (a BERT model pre-trained on legal and GDPR-specific texts)
- Task: Binary classification to predict violation results (0: no violation, 1: violation)
- Training Method: 5-fold cross-validation with hyperparameter optimization
Dataset
The model was trained on a custom GDPR violation dataset containing real violation cases. The dataset includes:
- 2,412 cases total (2,058 violations, 354 non-violations)
- Features include affected data volume, countries, industry sectors, data categories, data processing basis, GDPR clauses, and various violation indicators
- All categorical features were converted to text descriptions for the transformer model
- Dataset link: https://huggingface.co/datasets/JQ1984/GDPRcasedata
Training Methodology
The training pipeline followed these steps:
- Text Conversion: All numerical and categorical features were converted to text descriptions
- K-Fold Cross-Validation: 5-fold cross-validation was used to ensure robust model performance
- Fine-tuning: LegalBERT model was fine-tuned on the classification task
- Hyperparameters:
- Batch size: 16
- Learning rate: 3e-5
- Epochs: 3
- Weight decay: 0.01
- Optimizer: AdamW
Performance Metrics
The model achieved the following performance metrics across 5-fold cross-validation:
- Average Accuracy: 95.03%
- Average F1 Score: 89.33%
- Average Precision: 92.79%
- Average Recall: 86.60%
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load the model and tokenizer
model_path = "YOUR_USERNAME/gdpr-violation-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Example text (format similar to training data)
text = "GDPR clauses are Art. 5, Art. 6. Date is 2022-05-15. country is Germany. company_industry is Technology. data_category_personal_data is true. data_processing_basis_consent is true."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
outputs = model(**inputs)
probabilities = outputs.logits.softmax(dim=-1)
predicted_class = outputs.logits.argmax(dim=-1).item()
print(f"Predicted class: {predicted_class}")
print(f"Class probabilities: {probabilities[0].tolist()}")
Contact
For questions, feedback, or collaboration opportunities, please contact: Jacques Qiu(邱耿航) Email: [email protected] GitHub: JacquotQ LinkedIn: https://www.linkedin.com/in/jacques-qiu-50477b266/
- Downloads last month
- 18
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for JQ1984/violation_result_GDPR_prediction
Base model
nlpaueb/legal-bert-base-uncased
Finetuned
JQ1984/legalbert_gdpr_pretrained