Update README.md

aedd902 over 1 year ago

4.32 kB

metadata

tags:
  - generated_from_trainer
datasets:
  - RaiBP/openwebtext2-first-30-chunks-ablation-full
model-index:
  - name: training_full
    results: []

training_full

This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-ablation-full dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

The following command was used:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_clm.py \
--output_dir="./training_full" \
--model_type="gpt2" \
--config_name="./training" \
--tokenizer_name="./training" \
--dataset_name="RaiBP/openwebtext2-first-30-chunks-ablation-full" \
--do_train \
--per_device_train_batch_size 8 \
--block_size="1024" \
--learning_rate="5e-3" --warmup_steps="1000" \
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
--overwrite_output_dir \
--num_train_epochs="1" \
--logging_steps="500" \
--save_steps="5000" --preprocessing_num_workers="16" \
--gradient_accumulation_steps="4" \
--report_to="tensorboard" \
--logging_dir="./log_full"

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.005
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 4
total_train_batch_size: 64
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 1.0

Training results

Evaluation results

Perplexity on the first 5000 examples of the Wiki-40B test sets, using the code provided in the perplexity docs, with 512 tokes of stride:

Target language	PPL
en	33.41002655029297

The following script was used for evaluation

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm

def filter_example(example):
    # remove Wiki-40B section dividers
    example = example.replace("_START_ARTICLE_", "")
    example = example.replace("_START_SECTION_", "")
    example = example.replace("_START_PARAGRAPH_", "")
    return example

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load the model
model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-full"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

target_language = "en" # change language to fr, de, es, etc.
test = load_dataset("wiki40b", target_language, split="test")

num_examples = 5000 # how many examples to run the evaluation on
examples = test["text"][:num_examples]
examples = [filter_example(example) for example in examples]
encodings = tokenizer("\n\n".join(examples), return_tensors="pt")

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

print("Perplexity: ", ppl.item())

Framework versions

Transformers 4.37.0.dev0
Pytorch 1.13.0
Datasets 2.16.0
Tokenizers 0.15.0