framework

PreDA-large (Prefix-Based Dream Reports Annotation)

This model is a fine-tuned version of google-t5/t5-large on the annotated Dreambank.net dataset.It achieves the following results on the evaluation set:

Intended uses & limitations

This model is designed for research purposes. See the disclaimer for more details.

Training procedure

The overall idea of our approach is to disentangle each dream report from its annotation as a whole and to create an augmented set of (dream report; single feature annotation). To make sure that, given the same report, the model would produce a specific HVDC feature, we simply append at the beginning of each report a string of the form ``HVDC-Feature:'', in a manner that closely mimics T5 task-specific prefix fine-tuning.

After this procedure to the original dataset (~1.8K) we obtain approximately 6.6K items. In the present study, we focused on a subset of six HVDC features: Characters, Activities, Emotion, Friendliness, Misfortune, and Good Fortune. This selection was made to exclude features that represented less than 10% of the total instances. Notably, Good Fortune would have been excluded under this criterion, but we intentionally retained this feature to control against potential memorisation effects and to provide a counterbalance to the Misfortune feature. After filtering out instances whose annotation feature is not one of the six selected features, we are left with ~5.3K dream reports. We then generate a random split of 80%-20% for the training (i.e., 4,311 reports) and testing (i.e. 1,078 reports) sets.

Training

Hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.001
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 20
  • label_smoothing_factor: 0.1

Training results

Training Loss Epoch Step Validation Loss Rouge1 Rouge2 Rougel Rougelsum
1.9478 1.0 539 1.9524 0.3298 0.1797 0.3121 0.3113
1.9141 2.0 1078 1.9039 0.3665 0.1942 0.3495 0.3489
1.914 3.0 1617 1.8993 0.4076 0.2223 0.3873 0.3870
1.9264 4.0 2156 1.8725 0.3454 0.1843 0.3306 0.3302
1.9018 5.0 2695 1.8669 0.3494 0.1814 0.3345 0.3347
1.889 6.0 3234 1.8872 0.3387 0.1609 0.3211 0.3208
1.8511 7.0 3773 1.8412 0.4200 0.2403 0.4065 0.4065
1.8756 8.0 4312 1.8191 0.4735 0.2705 0.4467 0.4469
1.8483 9.0 4851 1.7966 0.4915 0.2996 0.4662 0.4665
1.8182 10.0 5390 1.7787 0.5071 0.3169 0.4857 0.4860
1.7715 11.0 5929 1.7709 0.5017 0.3182 0.4767 0.4767
1.7955 12.0 6468 1.7557 0.4772 0.3015 0.4544 0.4549
1.7391 13.0 7007 1.7279 0.5644 0.3693 0.5270 0.5281
1.7013 14.0 7546 1.7054 0.5484 0.3694 0.5222 0.5221
1.7364 15.0 8085 1.6900 0.5607 0.3778 0.5349 0.5350
1.6592 16.0 8624 1.6643 0.6010 0.4191 0.5691 0.5688
1.645 17.0 9163 1.6448 0.6160 0.4440 0.5854 0.5863
1.6245 18.0 9702 1.6264 0.6301 0.4640 0.6015 0.6018
1.616 19.0 10241 1.6145 0.6578 0.4933 0.6253 0.6251
1.5914 20.0 10780 1.6073 0.6587 0.4979 0.6269 0.6270

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.1.0+cu118
  • Datasets 3.0.1
  • Tokenizers 0.19.1

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "jrc-ai/PreDA-large"
device = "cpu"
encoder_max_length = 100
decoder_max_length = 50

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

dream = "I was talking with my brother about my birthday dinner. I was feeling sad."
prefixes = ["Emotion", "Activities", "Characters"]
text_inputs = ["{} : {}".format(p, dream) for p in prefixes]

inputs = tokenizer(
    text_inputs,
    max_length=encoder_max_length,
    truncation=True,
    padding=True,
    return_tensors="pt"
)

output = model.generate(
    **inputs.to(device),
    do_sample=False,
    max_length=decoder_max_length,
)

for decode_dream in output:
    print(tokenizer.decode(decode_dream, skip_special_tokens=True))

Dual-Use Implication

Upon evaluation we identified no dual-use implication for the present model

Cite

Please note that the paper referring to this model, titled PreDA: Prefix-Based Dream Reports Annotation with Generative Language Models, has been accepted for publication at LOD 2025 conference and will appear in the conference proceedings.

Downloads last month
8
Safetensors
Model size
738M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jrc-ai/PreDA-large

Base model

google-t5/t5-large
Finetuned
(95)
this model
Quantizations
1 model