|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- facebook/bart-large-cnn |
|
pipeline_tag: summarization |
|
library_name: transformers |
|
tags: |
|
- movies |
|
- books |
|
- quotes |
|
- quote-detection |
|
- extractive |
|
--- |
|
|
|
# BART-large-quotes |
|
|
|
BART \[1\] fine-tuned for extractive summarization on a dataset of movie and book quotes. |
|
Continued fine-tuning from the BART-large-cnn checkpoint, which was fine-tuned on the CNN Daily Mail, which is more extractive than abstractive. |
|
|
|
**Compare**: The smaller model [BART-base-quotes](https://huggingface.co/ChrisBridges/bart-base-quotes) achieved slightly smaller ROUGE scores, |
|
but favors shorter quotes (~1/4 length on average). |
|
|
|
## Training Description |
|
|
|
### Dataset |
|
|
|
The model was trained on 11295 quotes, comprising 6280 movie quotes from the Cornell Movie Quotes dataset \[2\] and 5015 book quotes from the T50 dataset \[3\]. |
|
As described in the T50 paper, each movie quote is accompanied by a context of 4 sentences each on the left and the right, while 10 sentences are used for book quotes. |
|
Training/Development/Test splits of proportions 7:1:2 were created with stratified sampling. |
|
The tables below report the sample sizes in each data split and the length statistics of the contexts and quotes in each sample. |
|
|
|
| Split | Total | Movie | Book | |
|
| ----- | ----- | ----- | ---- | |
|
| Train | 7906 | 4396 | 3510 | |
|
| Dev | 1130 | 628 | 502 | |
|
| Test | 2259 | 1256 | 1003 | |
|
|
|
| Data | min | median | max | mean ± std | |
|
| ------------- | --- | ------ | ---- | ---------------- | |
|
| Movie Context | 38 | 148 | 3358 | 167.13 ± 102.57 | |
|
| Movie Quote | 5 | 20 | 592 | 28.22 ± 27.79 | |
|
| T50 Context | 86 | 628 | 6497 | 659.14 ± 310.49 | |
|
| T50 Quote | 6 | 41 | 877 | 61.87 ± 63.89 | |
|
| Total Context | 38 | 246 | 6497 | 385.58 ± 329.258 | |
|
| Total Quote | 5 | 26 | 877 | 43.16 ± 50.21 | |
|
|
|
### Parameters |
|
|
|
Each experiment uses a max input length of 1024 and a max output length of 128 to account for the short average length of quotes. |
|
While there is a significant variance in the length of quotes, poignant statements are of the most interest. |
|
|
|
Each BART model is trained with a batch size of 32 for 30 epochs (7440 steps) using AdamW with 0.01 weight decay and a linearly annealing learning rate of 5e-5. |
|
The first 5% of steps, i.e., 1.5 epochs, are used for a linear warmup. The model is evaluated every 500 steps w.r.t. ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-Lsum. |
|
After the training, the checkpoint with the best eval_rougeL is loaded to prefer extractive over abstractive summarization. FP16 mixed precision is used. |
|
|
|
In addition, T5-base \[4\] is evaluated with a batch size of 8 (29670 steps) due to the greater memory footprint, and a peak learning rate of 3e-4. |
|
|
|
The learning rates were chosen empirically on shorter training runs of 5 epochs. |
|
|
|
### Evaluation |
|
|
|
Since no data splits were published with the T50 paper \[3\], the results are not fully reproducible, and models are evaluated on the previously described training data. |
|
Rather than using the whole test set at once for evaluation, it is split into 3 equally-sized disjoint random samples of size 753. |
|
Each model is evaluated on all 3 samples, and the mean scores and 95% confidence interval for all scores are reported below. |
|
Additionally, the table includes the average predicted quote length, the number of epochs of the best training checkpoint, and the total training time. |
|
|
|
| Checkpoint | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum | Avg Quote Length | Epochs | Time | |
|
| -------------- | --------------- | --------------- | --------------- | --------------- | ---------------- | ------ | ------- | |
|
| T5-base | 0.3758 ± 0.0175 | 0.2990 ± 0.0128 | 0.3628 ± 0.0189 | 0.3684 ± 0.0201 | 18.1576 ± 0.1084 | 1.01 | 3:39:14 | |
|
| BART-base | 0.4236 ± 0.0133 | 0.3498 ± 0.0116 | 0.4112 ± 0.0135 | 0.4165 ± 0.0107 | 19.1027 ± 0.1755 | 12.10 | 0:44:48 | |
|
| BART-large | 0.4252 ± 0.0240 | 0.3456 ± 0.0204 | 0.4115 ± 0.0206 | 0.4171 ± 0.0209 | 19.2877 ± 0.1819 | 6.05 | 2:43:56 | |
|
| BART-large-cnn | 0.4384 ± 0.0225 | 0.3693 ± 0.0197 | 0.4165 ± 0.0239 | 0.4317 ± 0.0234 | 81.8623 ± 1.5324 | 28.23 | 3:48:24 | |
|
|
|
## References |
|
\[1\] [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension |
|
](https://arxiv.org/abs/1910.13461) |
|
\[2\] [You Had Me at Hello: How Phrasing Affects Memorability](https://aclanthology.org/P12-1094/) |
|
\[3\] [Quote Detection: A New Task and Dataset for NLP](https://aclanthology.org/2023.latechclfl-1.3/) |
|
\[4\] [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) |