Problems running on CPU
Hi,
First of all, thank you for the amazing work to the Stability AI team, the speed of the new model is amazing!!!
I'm using the sample code under "Use this model" > "Stable Audio Tools" and tried running it on my Macbook Air M2 and it fallsback to running on CPU (as expected) but then the call to generate_diffusion_cond
gets stuck indefinitely (I didn't wait until the end of it but I waited for several minutes until I terminated the process manually).
I then went ahead and tested the code on a GPU server provider and GPU generation works as fast as expected.
I also tested on a basic hosting service with no GPU, falling back again into CPU mode (no ARM chip this time), and the same behaviour again, generating got stuck for minutes until terminated.
Am I the only one experiencing this?
I can also add that with the same setup and almost the exact python code I am able to run stable-open-1 model without those issues although the generation takes longer of course.
I am also facing this currently! It hangs during the model.pretransform.decode(sampled) call in generation.py
This is on Windows btw with Intel i9
Yeah, the VAE decoder is certainly the biggest bottleneck for this model.
The model doesn't necessarily run well on CPUs out of the box, you can follow the guide that Arm put out to get a compiled version of the model that works better on CPU: https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/run-stable-audio-open-small-with-lite-rt
We'll also be working with Arm to get this LiteRT version hosted on Hugging Face with the proper documentation to run more efficiently on CPU. As I understand it, Arm CPUs are not specifically required, LiteRT works across many different device types.
Thanks for clarifying @Fauno15 ! sounds promising, I will follow the tutorial, I'm not working with android but perhaps it will be clear where to take on from in the tutorial, and make use of it, If I get it right the LiteRT is a "packaged" version of the model I can use within my own code/app? Thanks again!
The issue seems to be with the model being in float16, which made the conv1d layers in the decoder very slow.
Adding these two lines solved the problem for me:
model.pretransform.model_half = False
model.to(torch.float32)
Interestingly the encoder was completely fine being in float16, not sure why the decoder in particular was affected so much.
For MacOS, presuming you have Apple Silicon, you can follow these instructions to get accelerate torch training on Mac ("mps" device):
https://developer.apple.com/metal/pytorch/
Then just set your device accordingly:
device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"