kiddothe2b

Initial commit

75e9b16 about 3 years ago

4.56 kB

metadata

license: cc-by-nc-sa-4.0
pipeline_tag: fill-mask
language: en
tags:
  - long_documents
datasets:
  - c4
model-index:
  - name: kiddothe2b/longformer-mini-1024
    results: []

Longformer / longformer-mini-1024

Model description

Longformer is a transformer model for long documents. This version of Longformer is presented in An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification (Chalkidis et al., 2022).

The model has been warm-started re-using the weights of miniature BERT (Turc et al., 2019), and continued pre-trained for MLM following the paradigm of Longformer released by Beltagy et al. (2020). It supports sequences of length up to 1,024.

Longformer uses a combination of a sliding window (local) attention and global attention. Global attention is user-configured based on the task to allow the model to learn task-specific representations.

Intended uses & limitations

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole document to make decisions, such as document classification, sequential sentence classification or question answering.

How to use

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline
mlm_model = pipeline('fill-mask', model='kiddothe2b/longformer-mini-1024', trust_remote_code=True)
mlm_model("Hello I'm a <mask> model.")

You can also fine-tun it for SequenceClassification, SequentialSentenceClassification, and MultipleChoice down-stream tasks:

from transformers import AutoTokenizer, AutoModelforSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/longformer-mini-1024", trust_remote_code=True)
doc_classifier = AutoModelforSequenceClassification(model='kiddothe2b/longformer-mini-1024', trust_remote_code=True)

Limitations and bias

The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions.

Training procedure

Training and evaluation data

The model has been warm-started from google/bert_uncased_L-6_H-256_A-4 checkpoint and has been continued pre-trained for additional 50k steps on English Wikipedia.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
training_steps: 50000

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
1.7067	0.2	10000	1.5923	0.6714
1.6532	0.4	20000	1.5494	0.6784
1.622	0.6	30000	1.5208	0.6830
1.588	0.8	40000	1.4880	0.6876
1.5682	1.0	50000	1.4680	0.6908

Framework versions

Transformers 4.19.0.dev0
Pytorch 1.11.0
Datasets 2.0.0
Tokenizers 0.11.6

Citing

If you use this Longformer model in your research, please cite An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification, alongside Longformer: The Long-Document Transformer.

@misc{chalkidis-etal-2022-hat,
  url = {https://arxiv.org/abs/xxx},
  author = {Chalkidis, Ilias and Dai, Xiang and Fergadiotis, Manos and Malakasiotis, Prodromos and Elliott, Desmond},
  title = {An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification},
  publisher = {arXiv},
  year = {2022},
}

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}