File size: 4,696 Bytes
4e65175
814220b
 
4e65175
814220b
 
 
 
4e65175
 
814220b
4e65175
 
 
 
814220b
 
 
 
 
 
4e65175
814220b
 
 
 
 
4e65175
814220b
 
4e65175
814220b
4e65175
814220b
 
4e65175
 
 
 
 
814220b
4e65175
814220b
 
 
4e65175
fd4f018
 
 
4e65175
c019c1e
814220b
4e65175
814220b
 
4e65175
814220b
 
 
 
4e65175
814220b
4e65175
 
 
814220b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: bsd-3-clause
pipeline_tag: audio-classification
library_name: transformers
tags:
  - PyTorch
  - State-space
  - Mamba
---

# DASS: Distilled Audio State-space Models

<!-- Provide a quick summary of what the model is/does. -->


DASS: Distilled Audio State-space Models is an audio classification model finetuned on AudioSet-2M. 
DASS is the first state-space model that outperforms transformer-based audio classifiers such as AST (Audio Spectrogram Transformer), HTS-AT, and Audio-MAE.
DASS achieves state-of-the-art performance on the audio-classification
task on Audioset while significantly reducing the model size. For example, compared to AST which contains approximately 87M
parameters, DASS-small contains one-third, 30M, parameters and outperforms the AST model (AudioSet-2M map: 45.9 vs DASS small mAP: 47.2). 
It is available in two variants: DASS small (30M) mAP: 47.2 and DASS medium (49M) mAP: 47.6.

It is also significantly more duration robust (training on shorter audio and testing on long audio without fine-tuning on longer audio) than the AST model. 
For example, for both AST and DASS models training on 10-second long audios, the performance of AST models drops to less than 5 mAP when 
the input is 50 seconds, which is < 12% of the performance for 10-second input, while DASS’s performance is 45.5 mAP (96%) in the same setting. 
On a single A6000 GPU, DASS can take up to 2.5-hours of audio input and still maintain 62% of its
performance compared to a 10-second input.

It is introduced in the paper [DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners](https://arxiv.org/pdf/2407.04082) and 
first released in [this repository](https://github.com/Saurabhbhati/DASS).

## Model Details

DASS model in based on the [VMamba: Visual State Space Model](https://arxiv.org/pdf/2401.10166) applied to audio. 
It is trained with binary cross entropy loss w.r.t. ground truth labels and kl-divergence loss w.r.t teacher AST model. 

## How to Get Started with the Model

Use the code below to get started with the model.

```python

import torch
import librosa
from transformers import AutoConfig, AutoModelForAudioClassification, AutoFeatureExtractor

config = AutoConfig.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
audio_model = AutoModelForAudioClassification.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)

waveform, sr = librosa.load("audio/eval/_/_/--4gqARaEJE_0.000.flac", sr=16000)
inputs = feature_extractor(waveform,sr, return_tensors='pt')

with torch.no_grad():
    logits = torch.sigmoid(audio_model(**inputs).logits)

predicted_class_ids = torch.where(logits[0] > 0.5)[0]
predicted_label = [audio_model.config.id2label[i.item()] for i in predicted_class_ids]
predicted_label
['Animal', 'Domestic animals, pets', 'Dog']

```

### Results

Below are the results for DASS models finetuned and evaluated on AudioSet-2M. 

|                                           | Params | Pretrain |  mAP |
|-------------------------------------------|:------:|:--------:|:----:|
| Transformer based models                                             |
| [AST](https://arxiv.org/pdf/2104.01778)                 |   87M  |   IN SL  | 45.9 |
| [HTS-AT](https://arxiv.org/pdf/2202.00874)              |   31M  |   IN SL  | 47.1 |
| [PaSST](https://arxiv.org/pdf/2110.05069)         |        |   IN SL  | 47.1 |
| [Audio-MAE](https://arxiv.org/pdf/2207.06405) |   86M  |    SSL   | 47.3 |
| Concurrent SSM models                     |        |          |      |
| [AuM](https://arxiv.org/pdf/2406.03344)               |   26M  |   IN SL  | 39.7 |
| [Audio Mamba](https://arxiv.org/pdf/2405.13636)         |   40M  |   IN SL  | 44.0 |
| DASS-Small                                |   30M  |   IN SL  | 47.2 |
| DASS-Medium                               |   49M  |   IN SL  | 47.6 |


## Citation

```bibtex
@article{bhati2024dass,
  title={DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners},
  author={Bhati, Saurabhchand and Gong, Yuan and Karlinsky, Leonid and Kuehne, Hilde and Feris, Rogerio and Glass, James},
  journal={arXiv preprint arXiv:2407.04082},
  year={2024}
}
```

## Acknowledgements 

This project is based on AST([paper](https://arxiv.org/pdf/2104.01778), [code](https://github.com/YuanGongND/ast/tree/master)), 
VMamba([paper](https://arxiv.org/pdf/2401.10166), [code](https://github.com/MzeroMiko/VMamba/tree/main)) thanks for their excellant works.
Please make sure to check them out.