Special tokens in output generation

by Matthieu - opened May 9, 2023

Discussion

Matthieu

May 9, 2023

•

edited May 9, 2023

Hello,

Thanks for sharing this model!

When generating output, and even if "skip_special_tokens=True" there are two special tokens at beginning ( ) and ending (\n) of this output, in addition to special whitespace tokens.
Is there any way of removing them and use space token instead of special whitespace tokens?

DachengLi

May 9, 2023

Thanks a lot for trying the model! Can you try using T5Tokenizer instead of AutoTokenizer, and uses spaces_between_special_tokens=False when decoding?

Matthieu

May 12, 2023

Thanks for your feedback! I have applied all your recommendations but I still have at the end of output generation a newline character (\n).

Any idea?

DachengLi

May 12, 2023

Hi,
Can you take a screenshot of the problem(input, tokenized input, decoded etc) so that we can walk through it a bit? BTW, here is a question we got from the GitHub. It seems pretty similar: https://github.com/lm-sys/FastChat/issues/1022. Maybe you can also take a look?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment