Tokenization/decoding bug with "_"?
#13
by
anttip
- opened
The model sometimes outputs broken syntax, with segments starting with "_" replaced with special tokens. I noticed this first with a sequence with "_to". Now I get this generated:
df = pd.read<s> = pd.read_csv(
This is using transformers 4.33.1 and current transformers-4.34.0.dev0
The issue could be with bitsandbytes quantization, adding bnb_4bit_quant_type="nf4" to bitsandbytes config fixes the example above