NPU performance

by Echo9Zulu - opened 11 days ago

Discussion

Echo9Zulu

11 days ago

Hi!

A few questions:

Did you you use openvino genai or optimum-intel?

How much context were you able to run?

Can you share throughput results?

Thanks for the upload! Check out my repo for even more OpenVINO quants.

bweng

Owner 10 days ago

•

edited 10 days ago

It should work with openvino-genai and optimum-intel, but you need to make sure the NPU driver is updated and you're using openvino 2025.2.0. We only tested with openvino-genai.

pip install openvino==2025.2.0
pip install optimum[openvino]
pip install openvino-genai

NPU driver version required: 32.0.100.4023

Thanks for sharing all your models! Have you tested Gemma on NPU? We've been meaning to convert them as well but haven't gotten around to it yet.

Context-wise, we tested up to 8k, but the results weren't great. We ended up going with the https://huggingface.co/bweng/qwen3-8b-int4-ov-npu model instead.

You probably know this, but in openvino-genai you can increase the input context with MAX_PROMPT_LEN:

import openvino_genai as ov_genai
device = "NPU"
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, CACHE_DIR="./cache")

https://huggingface.co/bweng/phi-4-mini-instruct-int4-ov-npu

Echo9Zulu

6 days ago

I do not have an NPU device lol. There have been many people over at OpenArc who have been interested in NPU but are usually looking for ready-made support opposed to working in code like you have been.

Haven't tested Gemma on NPU but I do have some converted models; maybe start with https://huggingface.co/Echo9Zulu/gemma-3-4b-it-int8_asym-ov on CPU/iGPU.

No problem! I did a deep dive and haven't stopped. I would like to document that we have NPU support but have not been able to test; OpenArc API would require minor adjustments to use NPU although it might be a bit slower than genai. At a workshop in May the devs said the speed difference between optimum-intel and genai comes down to dependencies which was interesting.

Definitely check outOpenArc discord, there are a bunch of people actively working with intel devices across the stack; in that way it's been a great resource.

bweng

Owner 6 days ago

Thanks! Just joined the server, I will ping you there. We have a couple NPU devices if you want to collaborate on something :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment