Update README.md

b391867 verified 1 day ago

8.17 kB

	---
	license: apache-2.0
	base_model:
	- facebook/musicgen-large
	base_model_relation: quantized
	pipeline_tag: text-to-audio
	language:
	- en #
	tags:
	- text-to-audio
	- music-generation
	- pytorch
	- annthem
	- qlip
	- thestage
	---

	# Elastic model: MusicGen Large. Fastest and most flexible models for self-serving.

	# Attention: this page is for informational purposes only, to use the models you need to wait for the update of elastic_models package!

	Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:

	* __XL__: Mathematically equivalent neural network (original compiled `facebook/musicgen-large`), optimized with our DNN compiler.
	* __L__: Near lossless model, with minimal degradation obtained on corresponding audio quality benchmarks.
	* __M__: Faster model, with minor and acceptable accuracy degradation.
	* __S__: The fastest model, with slight accuracy degradation, offering the best speed.
	* __Original__: The original `facebook/musicgen-large` model from Hugging Face, without QLIP compilation.

	__Goals of elastic models:__

	* Provide flexibility in cost vs quality selection for inference
	* Provide clear quality and latency benchmarks for audio generation
	* Provide interface of HF libraries: `transformers` and `elastic_models` with a single line of code change for using optimized versions.
	* Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT.
	* Provide the best models and service for self-hosting.

	> It\'s important to note that specific quality degradation can vary. We aim for S models to retain high perceptual quality. The "Original" in tables refers to the non-compiled Hugging Face model, while "XL" is the compiled original. S, M, L are ANNA-quantized and compiled.


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/7kuTModQp4_5lRqR5QJ5P.png)

	## Audio Examples

	Below are a few examples demonstrating the audio quality of the different Elastic MusicGen Large versions. These samples were generated on an NVIDIA H100 GPU with a duration of 20 seconds each. For a more comprehensive set of examples and interactive demos, please visit [musicgen.thestage.ai](http://music.thestage.ai).

	Prompt: "Calm lofi hip hop track with a simple piano melody and soft drums" (Audio: 20 seconds, H100 GPU)


	\| S \| M \| L \| XL (Compiled Original) \| Original (HF Non-Compiled) \|
	\|----------------------------------------------------------------------------------------------------------------\|----------------------------------------------------------------------------------------------------------------\|----------------------------------------------------------------------------------------------------------------\|------------------------------------------------------------------------------------------------------------------\|------------------------------------------------------------------------------------------------------------------\|
	\| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/S82_oagiYy2r00ZYpBJ3Q.mpga"></audio> \| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/n7RWM2q3YHUE0oA-oiISy.mpga"></audio> \| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/LBnfVjM2jNEqndVhBnXok.mpga"></audio> \| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/TYINxt_EcH-60oHMnO-B0.mpga"></audio> \| <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/6799fc8e150f5a4014b030ca/IKxeZ2LVYNsrjeNE9B7vS.mpga"></audio> \|

	-----



	## Inference

	To infer our MusicGen models, you primarily use the `elastic_models.transformers.MusicgenForConditionalGeneration` class. If you have compiled engines, you provide the path to them. Otherwise, for non-compiled or original models, you can use the standard Hugging Face `transformers.MusicgenForConditionalGeneration`.


	Example using `elastic_models` with a compiled model:

	```python
	import torch
	import scipy.io.wavfile

	from transformers import AutoProcessor
	from elastic_models.transformers import MusicgenForConditionalGeneration

	model_name_hf = "facebook/musicgen-large"
	elastic_mode = "S"

	prompt = "A groovy funk bassline with a tight drum beat"
	output_wav_path = "generated_audio_elastic_S.wav"
	hf_token = "YOUR_TOKEN"
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	processor = AutoProcessor.from_pretrained(model_name_hf, token=hf_token)

	model = MusicgenForConditionalGeneration.from_pretrained(
	model_name_hf,
	token=hf_token,
	torch_dtype=torch.float16,
	mode=elastic_mode,
	device=device,
	).to(device)
	model.eval()

	inputs = processor(
	text=[prompt],
	padding=True,
	return_tensors="pt",
	).to(device)

	print(f"Generating audio for: {prompt}...")
	generate_kwargs = {"do_sample": True, "guidance_scale": 3.0, "max_new_tokens": 256, "cache_implementation": "paged"}

	audio_values = model.generate(inputs, generate_kwargs)
	audio_values_np = audio_values.to(torch.float32).cpu().numpy().squeeze()

	sampling_rate = model.config.audio_encoder.sampling_rate
	scipy.io.wavfile.write(output_wav_path, rate=sampling_rate, data=audio_values_np)
	print(f"Audio saved to {output_wav_path}")
	```

	__System requirements:__
	* GPUs: NVIDIA H100, L40S.
	* CPU: AMD, Intel
	* Python: 3.8-3.11 (check dependencies for specific versions)

	To work with our elastic models and compilation tools, you\'ll need to install `elastic_models` and `qlip` libraries from TheStage:

	```shell
	pip install thestage
	pip install elastic_models[nvidia]\
	--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
	--extra-index-url https://pypi.nvidia.com\
	--extra-index-url https://pypi.org/simple

	pip install flash-attn==2.7.3 --no-build-isolation
	pip uninstall apex
	```

	Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:

	```shell
	thestage config set --api-token <YOUR_API_TOKEN>
	```

	Congrats, now you can use accelerated models and tools!

	----

	## Benchmarks

	Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for MusicGen models using our algorithms.
	The `Original` column in latency benchmarks typically refers to the Hugging Face `facebook/musicgen-large` model compiled without ANNA quantization (XL in our terminology).

	### Latency benchmarks (Tokens Per Second - TPS)

	Performance for generating audio (decoder stage, max_new_tokens = 256 (5 seconds audio)).


	Batch Size 1:

	\| GPU Type \| S \| M \| L \| XL \| Original \|
	\|--------\|---\|---\|---\|----\|----\|
	\| H100 \| 130.52 \| 129.87 \| 128.57 \| 129.25 \| 44.80 \|
	\| L40S \| 101.70 \| 95.65 \| 89.99 \| 83.39 \| 44.43 \|

	Batch Size 16:

	\| GPU Type \| S \| M \| L \| XL \| Original \|
	\|--------\|---\|---\|---\|----\|----\|
	\| H100 \| 106.06 \| 105.82 \| 107.07 \| 106.55 \| 41.09 \|
	\| L40S \| 74.97 \| 71.52 \| 68.09 \| 63.86 \| 36.40 \|

	Batch Size 32:

	\| GPU Type \| S \| M \| L \| XL \| Original \|
	\|--------\|---\|---\|---\|----\|----\|
	\| H100 \| 83.58 \| 84.13 \| 84.04 \| 83.90 \| 34.50 \|
	\| L40S \| 57.36 \| 55.60 \| 53.73 \| 51.33 \| 28.72 \|


	## Links

	* __Platform__: [app.thestage.ai](https://app.thestage.ai)
	* __Subscribe for updates__: [TheStageAI X (Twitter)](https://x.com/TheStageAI)
	* __Contact email__: [email protected]