how to run onnx model in python code?
Good morning, i download your onnx model 16fb with to run in python, but it can't run onnxruntime. how do i run it ?
this particular model isn't super straightforward to use.
for this model you'll have to initially provide KV cache values.
get your model input like normal (your input_ids and attention mask) and do something like this:
import onnxruntime as rt
import numpy as np
sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
session = rt.InferenceSession( onnx_path, providers=["CPUExecutionProvider"], sess_options=sess_options)
NUM_LAYERS = 28
NUM_KV_HEADS = 8
HEAD_DIM = 128
DTYPE = np.float32
sequence_length = input_ids.shape[1]
BATCH_SIZE = input_ids.shape[0]
position_ids = np.tile(np.arange(sequence_length, dtype=np.int64), (BATCH_SIZE, 1))
kv_shape = (BATCH_SIZE, NUM_KV_HEADS, 0, HEAD_DIM)
empty_kv_tensor = np.zeros(kv_shape, dtype=DTYPE)
past_key_values_inputs = {}
for i in range(NUM_LAYERS):
key_name = f"past_key_values.{i}.key"
value_name = f"past_key_values.{i}.value"
past_key_values_inputs[key_name] = empty_kv_tensor
past_key_values_inputs[value_name] = empty_kv_tensor
embeddings = session.run(
["last_hidden_state"],
{
"input_ids": input_ids,
"attention_mask": attention_mask,
"position_ids": position_ids,
} | past_key_values_inputs,
)[0]
and then do your last token pooling, normalization, cosine compare, etc.
Trying to get this running in Elixir w/ Ortex & rust bindings for ort, and not finding a way to add these KV cache keys. Is it possible to have an ONNX model that doesn't use those or is that crazy hard?
it's a little bit tricky creating the initial KV values since you have to match the dimensions of your inputs if you're doing any batching.
use optimum-cli with the safetensors model using a command line like this, and use feature-extraction or sentence-similarity task, and you'll get a simpler model than this one with only input_ids/attention_mask as inputs:
optimum-cli export onnx -m <model path> <output path> --task feature-extraction
if you want uint8 output and easier to use model, you can check out mine: https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8
I tried yours and had a hard time getting good cosine similarity with uint 8 outputs, but the one by janni-m is working well for me. Thank you!