Audio-Text-to-Text
PEFT
Safetensors
English
mistral-lmm
dorienh commited on
Commit
4f94bec
·
verified ·
1 Parent(s): 0cbda72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -18,21 +18,22 @@ pipeline_tag: audio-text-to-text
18
 
19
  SonicVerse is a multi-task music captioning model that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more. The model directly captures both low-level acoustic details as well as high-level musical attributes through a novel projection-based architecture that transforms audio input into natural language captions while simultaneously detecting music features through dedicated auxiliary heads. Additionally, SonicVerse enables the generation of temporally informed long captions for extended music pieces by chaining outputs from short segments using large language models, providing detailed time-informed descriptions that capture the evolving musical narrative.
20
 
21
- View **demo** on [our HuggingFace Space](https://huggingface.co/spaces/amaai-lab/SonicVerse).
 
22
  **Read the paper:** [SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning](https://arxiv.org/abs/2506.15154)
23
- **GitHub:** [https://github.com/AMAAI-Lab/SonicVerse](https://github.com/AMAAI-Lab/SonicVerse)
 
24
 
25
 
26
  ## How to Get Started with the Model
27
 
28
- Use the instructions provided on the[GitHub repository](https://github.com/AMAAI-Lab/SonicVerse) to run inference locally. Alternatively try out the model on the [Spaces demo](https://huggingface.co/spaces/amaai-lab/SonicVerse).
29
 
30
 
31
  ## Citation
32
 
33
  If you use SonicVerse, please cite our [paper](https://doi.org/10.48550/arXiv.2506.15154):
34
 
35
- **BibTeX:**
36
 
37
  ```bibtex
38
  @article{chopra2025sonicverse,
 
18
 
19
  SonicVerse is a multi-task music captioning model that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more. The model directly captures both low-level acoustic details as well as high-level musical attributes through a novel projection-based architecture that transforms audio input into natural language captions while simultaneously detecting music features through dedicated auxiliary heads. Additionally, SonicVerse enables the generation of temporally informed long captions for extended music pieces by chaining outputs from short segments using large language models, providing detailed time-informed descriptions that capture the evolving musical narrative.
20
 
21
+ View **demo** on [our HuggingFace Space](https://huggingface.co/spaces/amaai-lab/SonicVerse)
22
+
23
  **Read the paper:** [SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning](https://arxiv.org/abs/2506.15154)
24
+
25
+ **GitHub:** [https://github.com/AMAAI-Lab/SonicVerse](https://github.com/AMAAI-Lab/SonicVerse)
26
 
27
 
28
  ## How to Get Started with the Model
29
 
30
+ Use the instructions provided on the [GitHub repository](https://github.com/AMAAI-Lab/SonicVerse) to run inference locally. Alternatively try out the model on the [Spaces demo](https://huggingface.co/spaces/amaai-lab/SonicVerse).
31
 
32
 
33
  ## Citation
34
 
35
  If you use SonicVerse, please cite our [paper](https://doi.org/10.48550/arXiv.2506.15154):
36
 
 
37
 
38
  ```bibtex
39
  @article{chopra2025sonicverse,