QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · perfect upload. stop conversation when finished.

robert1968

Apr 19, 2024

This updated files works fine.
I use Meta-Llama-3-8B-Instruct.Q8_0.gguf and Meta-Llama-3-8B-Instruct.Q6_K.gguf and both perfectly stop conversation when finished.
Many thanks. :)

MoonRide

Apr 21, 2024

@0-hero Could you tell how did you make current GGUFs? They work well, and models stops their turn as they should - but when I tried to reproduce the conversion with convert-hf-to-gguf.py from current llama.cpp (b2709) that (in theory) supports Llama 3, I get GGUFs that just don't stop the generation. Did you change any of the tokenizer configuration files vs original Llama repo, and/or used any specific llama.cpp commit / PR?

0-hero

Quant Factory org Apr 21, 2024

@MoonRide we changed the config files from the original repo

MoonRide

Apr 21, 2024

@0-hero Could you share those changes, and/or maybe propose a PR to the original Meta repo?

0-hero

Quant Factory org Apr 21, 2024

I think the changes are already merged

MoonRide

Apr 21, 2024

@0-hero I pulled the current configs from original repo, but GGUFs made with llama.cpp b2709 still didn't stop the generation. But I experimented a bit, and changed
"eos_token": "<|end_of_text|>",
to
"eos_token": "<|eot_id|>",
in tokenizer_config.json, and that finally made the generation stop in Ollama after model turn (like in those GGUFs of yours). But I am not sure if that's the best / proper way, or if it won't have some side-effects in other apps. Always something with tokenizer and/or the chat template, sigh...