|
--- |
|
language: en |
|
license: apache-2.0 |
|
library_name: transformers |
|
tags: |
|
- bert |
|
- text-classification |
|
- privacy-policy |
|
- gdpr |
|
- torchscript |
|
datasets: |
|
- MAPP-116 |
|
metrics: |
|
- f1 |
|
model-index: |
|
- name: PARENT BERT |
|
results: |
|
- task: |
|
type: text-classification |
|
dataset: |
|
name: MAPP-116 |
|
type: text |
|
metrics: |
|
- name: f1 |
|
type: score |
|
value: 0.80 |
|
--- |
|
|
|
|
|
|
|
|
|
# PARENT BERT Models for Privacy Policy Analysis |
|
|
|
This repository contains **TorchScript versions of 15 fine-tuned BERT models** used in the PARENT project to analyse mobile app privacy policies. These models identify **what data is collected, why it is collected, and how it is processed**, helping assess GDPR compliance. |
|
|
|
They are part of a hybrid framework designed for non-technical users, particularly parents concerned about children’s privacy. |
|
|
|
--- |
|
|
|
## Model Purpose |
|
|
|
- Segment privacy policies to detect: |
|
- Data collection types (e.g., contact info, location) |
|
- Purpose of data collection |
|
- How data is processed |
|
- Support GDPR compliance evaluation |
|
- Detect potential third-party sharing (in combination with a logistic regression model) |
|
|
|
--- |
|
## References |
|
|
|
- **MAPP Dataset:** Arora, S., Hosseini, H., Utz, C., Bannihatti Kumar, V., Dhellemmes, T., Ravichander, A., Story, P., Mangat, J., Chen, R., Degeling, M., Norton, T.B., Hupperich, T., Wilson, S., & Sadeh, N.M. (2022). *A tale of two regulatory regimes: Creation and analysis of a bilingual privacy policy corpus*. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022). [PDF link](https://aclanthology.org/2022.lrec-1.585.pdf) [Accessed 12 July 2025]. |
|
--- |
|
|
|
## Usage |
|
|
|
```python |
|
import torch |
|
from transformers import BertTokenizerFast |
|
from huggingface_hub import hf_hub_download |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
REPO_ID = "Bnaad/PARENT_bert" |
|
|
|
# Load tokenizer |
|
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") |
|
|
|
# Load one TorchScript model from Hugging Face |
|
label_name = "Information Type_Contact information" |
|
safe_label = label_name.replace(" ", "_").replace("/", "_") |
|
filename = f"torchscript_{safe_label}.pt" |
|
model_path = hf_hub_download(repo_id=REPO_ID, filename=filename) |
|
model = torch.jit.load(model_path, map_location=device) |
|
model.to(device) |
|
model.eval() |
|
|
|
# Example inference |
|
sample_text = """For any questions about your account or our services, please contact our customer support team by emailing [email protected], calling +1-800-555-1234, or visiting our office at 123 Main Street, Springfield, IL, 62701 during business hours""" |
|
inputs = tokenizer( |
|
sample_text, |
|
return_tensors="pt", |
|
truncation=True, |
|
padding="max_length", |
|
max_length=512 |
|
).to(device) |
|
|
|
with torch.no_grad(): |
|
outputs = model(inputs["input_ids"], inputs["attention_mask"]) |
|
|
|
print("Logits:", outputs) |
|
prob = torch.sigmoid(outputs.squeeze()) |
|
print(prob) |
|
|
|
|
|
|