---
license: apache-2.0
base_model:
- facebook/musicgen-large
base_model_relation: quantized
pipeline_tag: text-to-audio
language:
- en #
tags:
- text-to-audio
- music-generation
- pytorch
- annthem
- qlip
- thestage
---

# Elastic model: MusicGen Large. Fastest and most flexible models for self-serving.

# Attention: this page is for informational purposes only, to use the models you need to wait for the update of elastic_models package!

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

* __XL__: Mathematically equivalent neural network (original compiled `facebook/musicgen-large`), optimized with our DNN compiler. 
* __L__: Near lossless model, with minimal degradation obtained on corresponding audio quality benchmarks.
* __M__: Faster model, with minor and acceptable accuracy degradation.
* __S__: The fastest model, with slight accuracy degradation, offering the best speed.
* __Original__: The original `facebook/musicgen-large` model from Hugging Face, without QLIP compilation.

__Goals of elastic models:__

* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks for audio generation
* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions.
* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT.
* Provide the best models and service for self-hosting.

> It\'s important to note that specific quality degradation can vary. We aim for S models to retain high perceptual quality. The "Original" in tables refers to the non-compiled Hugging Face model, while "XL" is the compiled original. S, M, L are ANNA-quantized and compiled.


![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/7kuTModQp4_5lRqR5QJ5P.png)

## Audio Examples

Below are a few examples demonstrating the audio quality of the different Elastic MusicGen Large versions. These samples were generated on an NVIDIA H100 GPU with a duration of 20 seconds each. For a more comprehensive set of examples and interactive demos, please visit [musicgen.thestage.ai](http://music.thestage.ai).

**Prompt:** "Calm lofi hip hop track with a simple piano melody and soft drums" (Audio: 20 seconds, H100 GPU)


| S                                                                                                              | M                                                                                                              | L                                                                                                              | XL (Compiled Original)                                                                                           | Original (HF Non-Compiled)                                                                                       |
|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/S82_oagiYy2r00ZYpBJ3Q.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/n7RWM2q3YHUE0oA-oiISy.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/LBnfVjM2jNEqndVhBnXok.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/TYINxt_EcH-60oHMnO-B0.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/IKxeZ2LVYNsrjeNE9B7vS.mpga"></audio> |

-----


## Inference

To infer our MusicGen models, you primarily use the `elastic_models.transformers.MusicgenForConditionalGeneration` class. If you have compiled engines, you provide the path to them. Otherwise, for non-compiled or original models, you can use the standard Hugging Face `transformers.MusicgenForConditionalGeneration`.


**Example using `elastic_models` with a compiled model:**

```python
import torch
import scipy.io.wavfile

from transformers import AutoProcessor
from elastic_models.transformers import MusicgenForConditionalGeneration

model_name_hf = "facebook/musicgen-large"
elastic_mode = "S"

prompt = "A groovy funk bassline with a tight drum beat"
output_wav_path = "generated_audio_elastic_S.wav"
hf_token = "YOUR_TOKEN"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained(model_name_hf, token=hf_token)

model = MusicgenForConditionalGeneration.from_pretrained(
    model_name_hf,
    token=hf_token,
    torch_dtype=torch.float16,
    mode=elastic_mode,
    device=device,
).to(device)
model.eval()

inputs = processor(
    text=[prompt],
    padding=True,
    return_tensors="pt",
).to(device)

print(f"Generating audio for: {prompt}...")
generate_kwargs = {"do_sample": True, "guidance_scale": 3.0, "max_new_tokens": 256, "cache_implementation": "paged"}

audio_values = model.generate(**inputs, **generate_kwargs)
audio_values_np = audio_values.to(torch.float32).cpu().numpy().squeeze()

sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write(output_wav_path, rate=sampling_rate, data=audio_values_np)
print(f"Audio saved to {output_wav_path}")
```

__System requirements:__
* GPUs: NVIDIA H100, L40S.
* CPU: AMD, Intel
* Python: 3.8-3.11 (check dependencies for specific versions)

To work with our elastic models and compilation tools, you\'ll need to install `elastic_models` and `qlip` libraries from TheStage:

```shell
pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

pip install flash-attn==2.7.3 --no-build-isolation
pip uninstall apex
```

Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

```shell
thestage config set --api-token <YOUR_API_TOKEN>
```

Congrats, now you can use accelerated models and tools!

----

## Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for MusicGen models using our algorithms. 
The `Original` column in latency benchmarks typically refers to the Hugging Face `facebook/musicgen-large` model compiled without ANNA quantization (XL in our terminology). 

### Latency benchmarks (Tokens Per Second - TPS)

Performance for generating audio (decoder stage, max_new_tokens = 256 (5 seconds audio)). 


**Batch Size 1:**

| GPU Type | S | M | L | XL | Original |
|--------|---|---|---|----|----|
| H100 | 130.52 | 129.87 | 128.57 | 129.25 | 44.80 |
| L40S | 101.70 | 95.65 | 89.99 | 83.39 | 44.43 |

**Batch Size 16:**

| GPU Type | S | M | L | XL | Original |
|--------|---|---|---|----|----|
| H100 | 106.06 | 105.82 | 107.07 | 106.55 | 41.09 |
| L40S | 74.97 | 71.52 | 68.09 | 63.86 | 36.40 |

**Batch Size 32:**

| GPU Type | S | M | L | XL | Original |
|--------|---|---|---|----|----|
| H100 | 83.58 | 84.13 | 84.04 | 83.90 | 34.50 |
| L40S | 57.36 | 55.60 | 53.73 | 51.33 | 28.72 |


## Links

* __Platform__: [app.thestage.ai](https://app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
* __Contact email__: contact@thestage.ai