attempting to run stable-audio-open-small with onnxruntime in swift/IOS
this is a mess. these models run successfully in python when validating em. haven't gotten the iphone to stop crashing yet.
when using the fp16_tools version, the diffusion component crashes the iphone on step 0.
when using the initial version, the decoder ((autoencoder_arm.onnx)) crashes the iphone.
nothing to see here, yet... just wanted a place to store these.
like everything else i do...pure vibes zero real knowledge.
Here's a python script i used to validate outputs against the original pytorch model.
there's another one using cfg stuff that gets essentially the same outputs.
#!/usr/bin/env python
import numpy as np, soundfile as sf, onnxruntime as ort
from transformers import AutoTokenizer
# Load ONNX models
dit = ort.InferenceSession("diffusion_dit_arm.onnx")
cond = ort.InferenceSession("conditioners.onnx")
dec = ort.InferenceSession("autoencoder_arm.onnx")
# Config
prompt = "lo-fi hip-hop beat with pianos 90bpm"
steps = 10
rng = np.random.RandomState(12345)
x = rng.randn(1, 64, 256).astype(np.float32)
# Conditioning
tok = AutoTokenizer.from_pretrained("t5-base")
tokens = tok(prompt, truncation=True, padding="max_length", max_length=128, return_tensors="np")
conds = cond.run(None, {
"input_ids": tokens["input_ids"].astype(np.int64),
"attention_mask": tokens["attention_mask"].astype(np.int64),
"seconds_total": np.array([10.0], dtype=np.float32)
})
cross, _, glob = conds
# Run 10 steps with linear t, no CFG
for i in range(steps):
t_val = 1.0 - i / (steps - 1)
t = np.array([t_val], dtype=np.float32)
v = dit.run(None, {
"x": x, "t": t,
"cross_attn_cond": cross,
"global_cond": glob
})[0]
x -= 0.1 * v # fixed Euler step
# Decode
audio = dec.run(None, {'sampled': x})[0]
if audio.shape[0] == 2:
audio = audio.T
audio /= np.abs(audio).max()
sf.write("onnx_lofi_linear.wav", audio, 44100)
print("โ
onnx_lofi_linear.wav written!")
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support