File size: 8,166 Bytes
e4adace
 
 
 
 
3af954d
e4adace
 
 
 
 
 
 
 
 
 
 
 
 
b391867
 
e4adace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b0c1b5
 
e4adace
232a861
 
c950876
232a861
c950876
232a861
0d63cc4
89c48b0
 
 
0d63cc4
89c48b0
0d63cc4
232a861
e4adace
 
 
 
 
 
 
 
 
 
 
e7d22c8
e4adace
 
 
 
 
 
 
 
e7d22c8
e4adace
 
 
 
 
 
 
6077464
e4adace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e7d22c8
e4adace
 
 
 
 
 
 
e7d22c8
 
e4adace
 
 
 
 
 
 
 
 
 
 
bd544e9
 
e4adace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79ee2b7
e4adace
 
4b0c1b5
 
 
 
 
 
79ee2b7
 
 
4b0c1b5
 
 
 
79ee2b7
4b0c1b5
79ee2b7
4b0c1b5
 
 
 
79ee2b7
e4adace
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
license: apache-2.0
base_model:
- facebook/musicgen-large
base_model_relation: quantized
pipeline_tag: text-to-audio
language:
- en #
tags:
- text-to-audio
- music-generation
- pytorch
- annthem
- qlip
- thestage
---

# Elastic model: MusicGen Large. Fastest and most flexible models for self-serving.

# Attention: this page is for informational purposes only, to use the models you need to wait for the update of elastic_models package!

Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

* __XL__: Mathematically equivalent neural network (original compiled `facebook/musicgen-large`), optimized with our DNN compiler. 
* __L__: Near lossless model, with minimal degradation obtained on corresponding audio quality benchmarks.
* __M__: Faster model, with minor and acceptable accuracy degradation.
* __S__: The fastest model, with slight accuracy degradation, offering the best speed.
* __Original__: The original `facebook/musicgen-large` model from Hugging Face, without QLIP compilation.

__Goals of elastic models:__

* Provide flexibility in cost vs quality selection for inference
* Provide clear quality and latency benchmarks for audio generation
* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions.
* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT.
* Provide the best models and service for self-hosting.

> It\'s important to note that specific quality degradation can vary. We aim for S models to retain high perceptual quality. The "Original" in tables refers to the non-compiled Hugging Face model, while "XL" is the compiled original. S, M, L are ANNA-quantized and compiled.


![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/7kuTModQp4_5lRqR5QJ5P.png)

## Audio Examples

Below are a few examples demonstrating the audio quality of the different Elastic MusicGen Large versions. These samples were generated on an NVIDIA H100 GPU with a duration of 20 seconds each. For a more comprehensive set of examples and interactive demos, please visit [musicgen.thestage.ai](http://music.thestage.ai).

**Prompt:** "Calm lofi hip hop track with a simple piano melody and soft drums" (Audio: 20 seconds, H100 GPU)


| S                                                                                                              | M                                                                                                              | L                                                                                                              | XL (Compiled Original)                                                                                           | Original (HF Non-Compiled)                                                                                       |
|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/S82_oagiYy2r00ZYpBJ3Q.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/n7RWM2q3YHUE0oA-oiISy.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/LBnfVjM2jNEqndVhBnXok.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/TYINxt_EcH-60oHMnO-B0.mpga"></audio> | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/IKxeZ2LVYNsrjeNE9B7vS.mpga"></audio> |

-----



## Inference

To infer our MusicGen models, you primarily use the `elastic_models.transformers.MusicgenForConditionalGeneration` class. If you have compiled engines, you provide the path to them. Otherwise, for non-compiled or original models, you can use the standard Hugging Face `transformers.MusicgenForConditionalGeneration`.


**Example using `elastic_models` with a compiled model:**

```python
import torch
import scipy.io.wavfile

from transformers import AutoProcessor
from elastic_models.transformers import MusicgenForConditionalGeneration

model_name_hf = "facebook/musicgen-large"
elastic_mode = "S"

prompt = "A groovy funk bassline with a tight drum beat"
output_wav_path = "generated_audio_elastic_S.wav"
hf_token = "YOUR_TOKEN"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = AutoProcessor.from_pretrained(model_name_hf, token=hf_token)

model = MusicgenForConditionalGeneration.from_pretrained(
    model_name_hf,
    token=hf_token,
    torch_dtype=torch.float16,
    mode=elastic_mode,
    device=device,
).to(device)
model.eval()

inputs = processor(
    text=[prompt],
    padding=True,
    return_tensors="pt",
).to(device)

print(f"Generating audio for: {prompt}...")
generate_kwargs = {"do_sample": True, "guidance_scale": 3.0, "max_new_tokens": 256, "cache_implementation": "paged"}

audio_values = model.generate(**inputs, **generate_kwargs)
audio_values_np = audio_values.to(torch.float32).cpu().numpy().squeeze()

sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write(output_wav_path, rate=sampling_rate, data=audio_values_np)
print(f"Audio saved to {output_wav_path}")
```

__System requirements:__
* GPUs: NVIDIA H100, L40S.
* CPU: AMD, Intel
* Python: 3.8-3.11 (check dependencies for specific versions)

To work with our elastic models and compilation tools, you\'ll need to install `elastic_models` and `qlip` libraries from TheStage:

```shell
pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

pip install flash-attn==2.7.3 --no-build-isolation
pip uninstall apex
```

Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

```shell
thestage config set --api-token <YOUR_API_TOKEN>
```

Congrats, now you can use accelerated models and tools!

----

## Benchmarks

Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for MusicGen models using our algorithms. 
The `Original` column in latency benchmarks typically refers to the Hugging Face `facebook/musicgen-large` model compiled without ANNA quantization (XL in our terminology). 

### Latency benchmarks (Tokens Per Second - TPS)

Performance for generating audio (decoder stage, max_new_tokens = 256 (5 seconds audio)). 


**Batch Size 1:**

| GPU Type | S | M | L | XL | Original |
|--------|---|---|---|----|----|
| H100 | 130.52 | 129.87 | 128.57 | 129.25 | 44.80 |
| L40S | 101.70 | 95.65 | 89.99 | 83.39 | 44.43 |

**Batch Size 16:**

| GPU Type | S | M | L | XL | Original |
|--------|---|---|---|----|----|
| H100 | 106.06 | 105.82 | 107.07 | 106.55 | 41.09 |
| L40S | 74.97 | 71.52 | 68.09 | 63.86 | 36.40 |

**Batch Size 32:**

| GPU Type | S | M | L | XL | Original |
|--------|---|---|---|----|----|
| H100 | 83.58 | 84.13 | 84.04 | 83.90 | 34.50 |
| L40S | 57.36 | 55.60 | 53.73 | 51.33 | 28.72 |


## Links

* __Platform__: [app.thestage.ai](https://app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
* __Contact email__: [email protected]