Collapses coherence almost immediately

#1
by ramendik - opened

Unfortunately this quant becomes incoherent for me almost immediately on both CPU and NPU. Example response to a prompt of "test":

Alright, I need to test something. Let me think about what I want to test. Maybe it's a new feature I heard about, or a bug I remember. Okay, if I remember correctly, let me start with that.

First, I should make sure I know what I'm testing is supposed to be. If it's an existing feature, I might have heard about it, I don't know yet, so I should check the documentation or maybe talk to the team. if it's something new, I will get the latest update.

test new features. Hmm, if I want to test something new, I might also want to make sure it's stable before using it. new features. test can help ensure that all the tests pass, and if I don't know what the new features are, I can try to see the latest release notes or maybe talk to the developers about the new features.

I'm not sure, maybe it's something related to the current features I'm using, but I don't know for sure. in that case, if I can try to think of the last time I used a new feature, what was bad, maybe something didnsomething I do not want to repeat now.

bad features I might have tried before, if I dontest something, if test I something else I could think of, if I test something, maybe think about the last time I used a new feature, test can help ensure that all the tests pass, and if I don some 3 1 2 1 something something 1 something 1 something 1 something 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

OpenVINO Toolkit org
edited 28 days ago

Hi @ramendik . Thank you for reporting. What code do you use? On CPU using this code:

import openvino_genai as ov_genai

device = "CPU"
pipe = ov_genai.LLMPipeline(model_path, device)
print(pipe.generate("test", max_length=200))

I've got:
Alright, so I came across this problem that says, "test," and I'm supposed to figure it out. Hmm, at first glance, it's a bit unclear. Maybe it's a typo or something is missing. Let me think about possible directions this could be going.

First, could "test" refer to a type of test? Like, maybe it's a scientific test or something? Or is it a software test? I'm not sure. Maybe I should consider the context where I found "test." If it's in a math problem, maybe it's about testing something, like a theory or a hypothesis.

Wait, another possibility is that "test" is part of a word or a sentence, but it's incomplete. Like, maybe it's the start of a word, and I need to finish it? For example, "testing something." But without more, it's hard to tell.

Alternatively, could "test" be a name of

I was on OpenVINO Model Server, the ready-to-roll docker image (I used that because the host system is Fedora and does not have the NPU userland - I just passed /dev/accel through to the container; the aim of the testing was to evaluate the NPU). I don't rememer if this particular log was the CPU or NPU run, but I got rambling and increasing incoherence on both CPU and NPU; I was not yet passing GPU to the container at that point.

However you have a stricter max_length - I'd suggest retrying a few times with a higher max_length as the simplest adjustment that might reproduce.

OpenVINO Toolkit org

Hi @ramendik ,
thanks for reaching out here.
I tried using this model on NPU (Intel Core Ultra 7 258V) and don't see any issue with generated text :
"
import openvino_genai as ov_genai

device = "NPU"
pipe = ov_genai.LLMPipeline("DeepSeek-R1-Distill-Qwen-7B-int4-cw-ov", device)
print(pipe.generate("test", max_length=1024))

Okay, I need to figure out how to approach the user's request. They wrote "test". Hmm, that's a bit vague. I should consider what they might be testing. Maybe they're testing if I understand them or if I can help them better., since the context isn't clear, I should provide a general response. "Testing" could refer to a few things, like testing a product, testing a skill, or even testing something in a specific context like a programming language or a scientific experiment. I should also be ready to ask them to clarify their request if I don't want to misunderstand them.
Alright, I'll make sure to respond in a friendly and understanding manner, letting them know I'm here to help.
Also, I'll keep it concise and clear, avoiding any unnecessary jargon.

It seems like you might be testing something, please provide more details so I can better assist you!
"

Please make sure to use the latest OpenVINO version (2025.3) and install the latest NPU driver as written here https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-npu.html

Sign up or log in to comment