how to run onnx model in python code?

by minhdang - opened 15 days ago

Discussion

minhdang

15 days ago

Good morning, i download your onnx model 16fb with to run in python, but it can't run onnxruntime. how do i run it ?

electroglyph

15 days ago

•

edited 15 days ago

this particular model isn't super straightforward to use.

for this model you'll have to initially provide KV cache values.

get your model input like normal (your input_ids and attention mask) and do something like this:

import onnxruntime as rt
import numpy as np

sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
session = rt.InferenceSession( onnx_path, providers=["CPUExecutionProvider"], sess_options=sess_options)
NUM_LAYERS = 28
NUM_KV_HEADS = 8
HEAD_DIM = 128
DTYPE = np.float32
sequence_length = input_ids.shape[1]
BATCH_SIZE = input_ids.shape[0]
position_ids = np.tile(np.arange(sequence_length, dtype=np.int64), (BATCH_SIZE, 1))
kv_shape = (BATCH_SIZE, NUM_KV_HEADS, 0, HEAD_DIM)
empty_kv_tensor = np.zeros(kv_shape, dtype=DTYPE)
past_key_values_inputs = {}
for i in range(NUM_LAYERS):
    key_name = f"past_key_values.{i}.key"
    value_name = f"past_key_values.{i}.value"
    past_key_values_inputs[key_name] = empty_kv_tensor
    past_key_values_inputs[value_name] = empty_kv_tensor
embeddings = session.run(
    ["last_hidden_state"],
    {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "position_ids": position_ids,
    } | past_key_values_inputs,
)[0]

and then do your last token pooling, normalization, cosine compare, etc.

tpendragon

6 days ago

Trying to get this running in Elixir w/ Ortex & rust bindings for ort, and not finding a way to add these KV cache keys. Is it possible to have an ONNX model that doesn't use those or is that crazy hard?

electroglyph

6 days ago

•

edited 6 days ago

it's a little bit tricky creating the initial KV values since you have to match the dimensions of your inputs if you're doing any batching.

use optimum-cli with the safetensors model using a command line like this, and use feature-extraction or sentence-similarity task, and you'll get a simpler model than this one with only input_ids/attention_mask as inputs:

optimum-cli export onnx -m <model path> <output path> --task feature-extraction

if you want uint8 output and easier to use model, you can check out mine: https://huggingface.co/electroglyph/Qwen3-Embedding-0.6B-onnx-uint8

tpendragon

3 days ago

I tried yours and had a hard time getting good cosine similarity with uint 8 outputs, but the one by janni-m is working well for me. Thank you!

shailj

1 day ago

@tpendragon can u send working code please

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment