how to run onnx model in python code?

#3
by minhdang - opened

Good morning, i download your onnx model 16fb with to run in python, but it can't run onnxruntime. how do i run it ?

this particular model isn't super straightforward to use.

for this model you'll have to initially provide KV cache values.

get your model input like normal (your input_ids and attention mask) and do something like this:

import onnxruntime as rt
import numpy as np

sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
session = rt.InferenceSession( onnx_path, providers=["CPUExecutionProvider"], sess_options=sess_options)
NUM_LAYERS = 28
NUM_KV_HEADS = 8
HEAD_DIM = 128
DTYPE = np.float32
sequence_length = input_ids.shape[1]
BATCH_SIZE = input_ids.shape[0]
position_ids = np.tile(np.arange(sequence_length, dtype=np.int64), (BATCH_SIZE, 1))
kv_shape = (BATCH_SIZE, NUM_KV_HEADS, 0, HEAD_DIM)
empty_kv_tensor = np.zeros(kv_shape, dtype=DTYPE)
past_key_values_inputs = {}
for i in range(NUM_LAYERS):
    key_name = f"past_key_values.{i}.key"
    value_name = f"past_key_values.{i}.value"
    past_key_values_inputs[key_name] = empty_kv_tensor
    past_key_values_inputs[value_name] = empty_kv_tensor
embeddings = session.run(
    ["last_hidden_state"],
    {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "position_ids": position_ids,
    } | past_key_values_inputs,
)[0]

and then do your last token pooling, normalization, cosine compare, etc.

Trying to get this running in Elixir w/ Ortex & rust bindings for ort, and not finding a way to add these KV cache keys. Is it possible to have an ONNX model that doesn't use those or is that crazy hard?

it's a little bit tricky creating the initial KV values since you have to match the dimensions of your inputs if you're doing any batching.

use optimum-cli with the safetensors model using a command line like this, and use feature-extraction or sentence-similarity task, and you'll get a simpler model than this one with only input_ids/attention_mask as inputs:

optimum-cli export onnx -m <model path> <output path> --task feature-extraction

if you want uint8 output and easier to use model, you can check out mine: https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8

I tried yours and had a hard time getting good cosine similarity with uint 8 outputs, but the one by janni-m is working well for me. Thank you!

@tpendragon can u send working code please

Sign up or log in to comment