Using non-fast tokenizers
Hi, I am trying to run a few tasks from NorBench on gpt-sw3-126m, specifically a sentiment analysis task. I have loaded the model and tokenizer using the suggested code:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
model_name = "AI-Sweden-Models/gpt-sw3-126m"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
prompt = "Träd är fina för att"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
model.to(device)
But when running the script for sentiment analysis, I get the following error:
ValueError: word_ids() is not available when using non-fast tokenizers (e.g. instance of a XxxTokenizerFast
class).
Is it correct that the instantiated tokenizer is slow, and does not support word_ids?
Here is a link to the script: https://github.com/ltgoslo/norbench/blob/main/evaluation_scripts/tsa_finetuning.py
@mysil
As part of the ScandEval framework I had to deal with this issue too. I ended up coding a "manual" version of word_ids
, and you can find that implementation here.