Still no GGUF uploads?
It has been 3 days since the repo was created, but there are still no actual GGUFs available. Is there something wrong with the chat template, or some other bug? From what I see on LM Studio, and using
@Bartowski
's quant, there appears to be a chat template issue, and after manually copying the chat template over, there seems to be a serious repetition issue.
Are you guys experiencing something similar, and thus choosing not to release any GGUFs yet?
About the repetition issue did you try presence_penalty 1.5 ? They suggest that on their modelcard and for me it solved the issue.
But I'm closely watching here too, but it seems other models got priority in their queue 🥹 Patience ;)
I would love to see this one eventually.
@mingyi456
I think I actually observe the same. Here is it for my case:
On long outputs (we can note that the number of tokens at which it happens is different from (1) to (2), but the order remains approx the same), suddenly, it either:
- collapses, repeating a single token (we can note below it actually switched from "same" to "time"), and never comes back (1)
- do not output anything more, while the GPU is still computes (2)
Command:
./llama-cli -m /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/llama-cpp/GGUFS/exaone-4.0.1-32b-q4_k_m.gguf \ # official 4.0.1 one from LG, but the issue was there in official 4.0 too
--presence_penalty 1.5 \
-c 56000 \
-fa \
-ngl 65 \
--temp 0.6 \
--top-p 0.95 \
--jinja \
--chat-template-file ~/exaone4.jinja # official, last updated one
(1)
The target temperature is ~5°C. And we consider that :
- if the fridge is empty, the air inside is already at the target temperature
- if the fridge is full, all the food is already at the target temperature
Key Factors:
- Thermal Mass and Temperature Stability:
- In a full fridge, the same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same time time time time time time time time time time time time time time time time time time
llama_perf_context_print: total time = 130320,94 ms / 4497 tokens
(2)
If the thermostat has a deadband, say it turns on at 6°C and off at 4°C for both, then the average temperature is the same, around 5°C.
But for empty fridge, since C is small, when heat leaks in, temperature rises quickly to 6°C, so the off-time is short, but on-time might be short too to cool it down.
Similarly for full, temperature rises slowly to 6°C, taking longer time, so off-time is
llama_perf_context_print: total time = 144432,42 ms / 4789 tokens
I'm going to do a few more tests without enabling flash-attention just to see, I'll report back here.