PARENT BERT Models for Privacy Policy Analysis
This repository contains TorchScript versions of 15 fine-tuned BERT models used in the PARENT project to analyse mobile app privacy policies. These models identify what data is collected, why it is collected, and how it is processed, helping assess GDPR compliance.
They are part of a hybrid framework designed for non-technical users, particularly parents concerned about children’s privacy.
Model Purpose
- Segment privacy policies to detect:
- Data collection types (e.g., contact info, location)
- Purpose of data collection
- How data is processed
- Support GDPR compliance evaluation
- Detect potential third-party sharing (in combination with a logistic regression model)
References
- MAPP Dataset: Arora, S., Hosseini, H., Utz, C., Bannihatti Kumar, V., Dhellemmes, T., Ravichander, A., Story, P., Mangat, J., Chen, R., Degeling, M., Norton, T.B., Hupperich, T., Wilson, S., & Sadeh, N.M. (2022). A tale of two regulatory regimes: Creation and analysis of a bilingual privacy policy corpus. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022). PDF link [Accessed 12 July 2025].
Usage
import torch
from transformers import BertTokenizerFast
from huggingface_hub import hf_hub_download
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
REPO_ID = "Bnaad/PARENT_bert"
# Load tokenizer
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
# Load one TorchScript model from Hugging Face
label_name = "Information Type_Contact information"
safe_label = label_name.replace(" ", "_").replace("/", "_")
filename = f"torchscript_{safe_label}.pt"
model_path = hf_hub_download(repo_id=REPO_ID, filename=filename)
model = torch.jit.load(model_path, map_location=device)
model.to(device)
model.eval()
# Example inference
sample_text = """For any questions about your account or our services, please contact our customer support team by emailing [email protected], calling +1-800-555-1234, or visiting our office at 123 Main Street, Springfield, IL, 62701 during business hours"""
inputs = tokenizer(
sample_text,
return_tensors="pt",
truncation=True,
padding="max_length",
max_length=512
).to(device)
with torch.no_grad():
outputs = model(inputs["input_ids"], inputs["attention_mask"])
print("Logits:", outputs)
prob = torch.sigmoid(outputs.squeeze())
print(prob)