Repetition/collapse issue

#1
by owao - opened

(from https://huggingface.co/unsloth/EXAONE-4.0-32B-GGUF/discussions/1)

Here is what I have been observing:

On long outputs (we can note that the number of tokens at which it happens is different from (1) to (2), but the order remains approx the same), suddenly, it either:

  • collapses, repeating a single token (we can note below it actually switched from "same" to "time"), and never comes back (1)
  • do not output anything more, while the GPU is still computes (2)

Command:

./llama-cli -m /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/llama-cpp/GGUFS/exaone-4.0.1-32b-q4_k_m.gguf \ # official 4.0.1 one from LG, but the issue was there in official 4.0 too
                    --presence_penalty 1.5 \
                    -c 56000 \
                    -fa \
                    -ngl 65 \
                    --temp 0.6 \
                    --top-p 0.95 \
                    --jinja \
                    --chat-template-file ~/exaone4.jinja # official, last updated one

(1)

The target temperature is ~5°C. And we consider that :

  • if the fridge is empty, the air inside is already at the target temperature
  • if the fridge is full, all the food is already at the target temperature

Key Factors:

  1. Thermal Mass and Temperature Stability:
    • In a full fridge, the same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same same time time time time time time time time time time time time time time time time time time
llama_perf_context_print:       total time =  130320,94 ms /  4497 tokens

(2)

If the thermostat has a deadband, say it turns on at 6°C and off at 4°C for both, then the average temperature is the same, around 5°C.

But for empty fridge, since C is small, when heat leaks in, temperature rises quickly to 6°C, so the off-time is short, but on-time might be short too to cool it down.

Similarly for full, temperature rises slowly to 6°C, taking longer time, so off-time is

llama_perf_context_print:       total time =  144432,42 ms /  4789 tokens

I'm going to do a few more tests without enabling flash-attention just to see, I'll report back here.

LG AI Research org

Hello, @owao . Thank your for your attention and contribution!

To help us better understand the issue, could you share the input and the output you tested?

I didn't say it, but I have the exact same issue when running llama-server instead of llama-cli.

Here is the llama.cpp build version:

version: 6008 (f1d4698f)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Here is the full output for (1): https://paste.nomagic.uk/?8e00abca55a2bace#vRx4xdKp5En89XfKxdNYEruH4SfKmWvukvXd4zZtYyK
And for (2): https://paste.nomagic.uk/?2514b7bcc4b5d528#EyfdfSBz6w6EijPnR4MZ8EN2r2d1nhfKt1cSsephtzNX

Don't hesitate to ask for other info you would require, as I would really want to give this model a try as a daily driver :)

Without flash-attention, it seems both issues are gone.
I tried only 4 generations because I'm running it on my hardware and it's taking a while each time, but I think it's worth experimenting further with this.
I'll provide a follow-up if I can confirm the issue is resolved long-term.

Sign up or log in to comment