Space before EOS

#3
by etemiz - opened

The eos token seems to be needing a space before it. Why?

modeL = "AlexWortega/miqu-1-70b-AQLM-2Bit-1x16-hf"
tokenizer = AutoTokenizer.from_pretrained(modeL)

print(tokenizer.special_tokens_map)
print(tokenizer.bos_token_id, "<s>", tokenizer.encode("<s>"))
print(tokenizer.eos_token_id, "</s>", tokenizer.encode("</s>"))
print("<s>hello</s>", tokenizer.encode("<s>hello</s>"))
print("<s>hello </s>", tokenizer.encode("<s>hello </s>"))
print("<s>[INST] hello [/INST] hi</s>", tokenizer.encode("<s>[INST] hello [/INST] hi</s>"))
print("<s>[INST] hello [/INST] hi </s>", tokenizer.encode("<s>[INST] hello [/INST] hi </s>"))

Output:

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>'}
1 <s> [1, 1]
2 </s> [1, 2]
<s>hello</s> [1, 1, 12199, 829, 29879, 29958]
<s>hello </s> [1, 1, 12199, 2]
<s>[INST] hello [/INST] hi</s> [1, 1, 29961, 25580, 29962, 22172, 518, 29914, 25580, 29962, 7251, 829, 29879, 29958]
<s>[INST] hello [/INST] hi </s> [1, 1, 29961, 25580, 29962, 22172, 518, 29914, 25580, 29962, 7251, 2]

Sign up or log in to comment