NPU performance

#1
by Echo9Zulu - opened

Hi!

A few questions:

Did you you use openvino genai or optimum-intel?

How much context were you able to run?

Can you share throughput results?

Thanks for the upload! Check out my repo for even more OpenVINO quants.

It should work with openvino-genai and optimum-intel, but you need to make sure the NPU driver is updated and you're using openvino 2025.2.0. We only tested with openvino-genai.

pip install openvino==2025.2.0
pip install optimum[openvino]
pip install openvino-genai

NPU driver version required: 32.0.100.4023

Thanks for sharing all your models! Have you tested Gemma on NPU? We've been meaning to convert them as well but haven't gotten around to it yet.

Context-wise, we tested up to 8k, but the results weren't great. We ended up going with the https://huggingface.co/bweng/qwen3-8b-int4-ov-npu model instead.

You probably know this, but in openvino-genai you can increase the input context with MAX_PROMPT_LEN:

import openvino_genai as ov_genai
device = "NPU"
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, CACHE_DIR="./cache")

https://huggingface.co/bweng/phi-4-mini-instruct-int4-ov-npu

I do not have an NPU device lol. There have been many people over at OpenArc who have been interested in NPU but are usually looking for ready-made support opposed to working in code like you have been.

Haven't tested Gemma on NPU but I do have some converted models; maybe start with https://huggingface.co/Echo9Zulu/gemma-3-4b-it-int8_asym-ov on CPU/iGPU.

No problem! I did a deep dive and haven't stopped. I would like to document that we have NPU support but have not been able to test; OpenArc API would require minor adjustments to use NPU although it might be a bit slower than genai. At a workshop in May the devs said the speed difference between optimum-intel and genai comes down to dependencies which was interesting.

Definitely check outOpenArc discord, there are a bunch of people actively working with intel devices across the stack; in that way it's been a great resource.

Thanks! Just joined the server, I will ping you there. We have a couple NPU devices if you want to collaborate on something :)

Sign up or log in to comment