|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- facebook/musicgen-large |
|
base_model_relation: quantized |
|
pipeline_tag: text-to-audio |
|
language: |
|
- en |
|
tags: |
|
- text-to-audio |
|
- music-generation |
|
- pytorch |
|
- annthem |
|
- qlip |
|
- thestage |
|
--- |
|
|
|
# Elastic model: MusicGen Large. Fastest and most flexible models for self-serving. |
|
|
|
# Attention: this page is for informational purposes only, to use the models you need to wait for the update of elastic_models package! |
|
|
|
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models: |
|
|
|
* __XL__: Mathematically equivalent neural network (original compiled `facebook/musicgen-large`), optimized with our DNN compiler. |
|
* __L__: Near lossless model, with minimal degradation obtained on corresponding audio quality benchmarks. |
|
* __M__: Faster model, with minor and acceptable accuracy degradation. |
|
* __S__: The fastest model, with slight accuracy degradation, offering the best speed. |
|
* __Original__: The original `facebook/musicgen-large` model from Hugging Face, without QLIP compilation. |
|
|
|
__Goals of elastic models:__ |
|
|
|
* Provide flexibility in cost vs quality selection for inference |
|
* Provide clear quality and latency benchmarks for audio generation |
|
* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions. |
|
* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT. |
|
* Provide the best models and service for self-hosting. |
|
|
|
> It\'s important to note that specific quality degradation can vary. We aim for S models to retain high perceptual quality. The "Original" in tables refers to the non-compiled Hugging Face model, while "XL" is the compiled original. S, M, L are ANNA-quantized and compiled. |
|
|
|
|
|
 |
|
|
|
## Audio Examples |
|
|
|
Below are a few examples demonstrating the audio quality of the different Elastic MusicGen Large versions. These samples were generated on an NVIDIA H100 GPU with a duration of 20 seconds each. For a more comprehensive set of examples and interactive demos, please visit [musicgen.thestage.ai](http://music.thestage.ai). |
|
|
|
**Prompt:** "Calm lofi hip hop track with a simple piano melody and soft drums" (Audio: 20 seconds, H100 GPU) |
|
|
|
|
|
| S | M | L | XL (Compiled Original) | Original (HF Non-Compiled) | |
|
|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------| |
|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/S82_oagiYy2r00ZYpBJ3Q.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/n7RWM2q3YHUE0oA-oiISy.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/LBnfVjM2jNEqndVhBnXok.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/TYINxt_EcH-60oHMnO-B0.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/IKxeZ2LVYNsrjeNE9B7vS.mpga"></audio> | |
|
|
|
----- |
|
|
|
|
|
|
|
## Inference |
|
|
|
To infer our MusicGen models, you primarily use the `elastic_models.transformers.MusicgenForConditionalGeneration` class. If you have compiled engines, you provide the path to them. Otherwise, for non-compiled or original models, you can use the standard Hugging Face `transformers.MusicgenForConditionalGeneration`. |
|
|
|
|
|
**Example using `elastic_models` with a compiled model:** |
|
|
|
```python |
|
import torch |
|
import scipy.io.wavfile |
|
|
|
from transformers import AutoProcessor |
|
from elastic_models.transformers import MusicgenForConditionalGeneration |
|
|
|
model_name_hf = "facebook/musicgen-large" |
|
elastic_mode = "S" |
|
|
|
prompt = "A groovy funk bassline with a tight drum beat" |
|
output_wav_path = "generated_audio_elastic_S.wav" |
|
hf_token = "YOUR_TOKEN" |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
processor = AutoProcessor.from_pretrained(model_name_hf, token=hf_token) |
|
|
|
model = MusicgenForConditionalGeneration.from_pretrained( |
|
model_name_hf, |
|
token=hf_token, |
|
torch_dtype=torch.float16, |
|
mode=elastic_mode, |
|
device=device, |
|
).to(device) |
|
model.eval() |
|
|
|
inputs = processor( |
|
text=[prompt], |
|
padding=True, |
|
return_tensors="pt", |
|
).to(device) |
|
|
|
print(f"Generating audio for: {prompt}...") |
|
generate_kwargs = {"do_sample": True, "guidance_scale": 3.0, "max_new_tokens": 256, "cache_implementation": "paged"} |
|
|
|
audio_values = model.generate(**inputs, **generate_kwargs) |
|
audio_values_np = audio_values.to(torch.float32).cpu().numpy().squeeze() |
|
|
|
sampling_rate = model.config.audio_encoder.sampling_rate |
|
scipy.io.wavfile.write(output_wav_path, rate=sampling_rate, data=audio_values_np) |
|
print(f"Audio saved to {output_wav_path}") |
|
``` |
|
|
|
__System requirements:__ |
|
* GPUs: NVIDIA H100, L40S. |
|
* CPU: AMD, Intel |
|
* Python: 3.8-3.11 (check dependencies for specific versions) |
|
|
|
To work with our elastic models and compilation tools, you\'ll need to install `elastic_models` and `qlip` libraries from TheStage: |
|
|
|
```shell |
|
pip install thestage |
|
pip install elastic_models[nvidia]\ |
|
--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\ |
|
--extra-index-url https://pypi.nvidia.com\ |
|
--extra-index-url https://pypi.org/simple |
|
|
|
pip install flash-attn==2.7.3 --no-build-isolation |
|
pip uninstall apex |
|
``` |
|
|
|
Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows: |
|
|
|
```shell |
|
thestage config set --api-token <YOUR_API_TOKEN> |
|
``` |
|
|
|
Congrats, now you can use accelerated models and tools! |
|
|
|
---- |
|
|
|
## Benchmarks |
|
|
|
Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for MusicGen models using our algorithms. |
|
The `Original` column in latency benchmarks typically refers to the Hugging Face `facebook/musicgen-large` model compiled without ANNA quantization (XL in our terminology). |
|
|
|
### Latency benchmarks (Tokens Per Second - TPS) |
|
|
|
Performance for generating audio (decoder stage, max_new_tokens = 256 (5 seconds audio)). |
|
|
|
|
|
**Batch Size 1:** |
|
|
|
| GPU Type | S | M | L | XL | Original | |
|
|--------|---|---|---|----|----| |
|
| H100 | 130.52 | 129.87 | 128.57 | 129.25 | 44.80 | |
|
| L40S | 101.70 | 95.65 | 89.99 | 83.39 | 44.43 | |
|
|
|
**Batch Size 16:** |
|
|
|
| GPU Type | S | M | L | XL | Original | |
|
|--------|---|---|---|----|----| |
|
| H100 | 106.06 | 105.82 | 107.07 | 106.55 | 41.09 | |
|
| L40S | 74.97 | 71.52 | 68.09 | 63.86 | 36.40 | |
|
|
|
**Batch Size 32:** |
|
|
|
| GPU Type | S | M | L | XL | Original | |
|
|--------|---|---|---|----|----| |
|
| H100 | 83.58 | 84.13 | 84.04 | 83.90 | 34.50 | |
|
| L40S | 57.36 | 55.60 | 53.73 | 51.33 | 28.72 | |
|
|
|
|
|
## Links |
|
|
|
* __Platform__: [app.thestage.ai](https://app.thestage.ai) |
|
* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI) |
|
* __Contact email__: [email protected] |
|
|