|
---
|
|
license: cc-by-4.0
|
|
metrics:
|
|
- accuracy
|
|
- f1
|
|
- uar
|
|
pipeline_tag: audio-classification
|
|
tags:
|
|
- audio
|
|
- audio-classification
|
|
- acoustic-scene-classification
|
|
- autrainer
|
|
library_name: autrainer
|
|
model-index:
|
|
- name: dcase-2020-t1a-cnn14-32k-t
|
|
results:
|
|
- task:
|
|
type: audio-classification
|
|
name: Acoustic Scene Classification
|
|
metrics:
|
|
- type: accuracy
|
|
name: Accuracy
|
|
value: 0.6778975741239892
|
|
- type: f1
|
|
name: F1
|
|
value: 0.6749168062342605
|
|
- type: uar
|
|
name: Unweighted Average Recall
|
|
value: 0.6778357903357903
|
|
---
|
|
|
|
# Acoustic Scene Classification Model
|
|
|
|
`CNN14` model from the [PANN](https://zenodo.org/records/3987831) family that classifies audio files into one of the following 10 different acoustic scenes:
|
|
_airport_, _bus_, _metro_, _metro_station_, _park_, _public_square_, _shopping_mall_, _street_pedestrian_, _street_traffic_, and _tram_.
|
|
|
|
## Installation
|
|
|
|
To use the model, you have to install autrainer, e.g. via pip:
|
|
|
|
```bash
|
|
pip install autrainer
|
|
```
|
|
|
|
## Usage
|
|
|
|
The model can be applied on all audio files present in a folder (`<data-root>`) and stores the predictions in another folder (`<output-root>`):
|
|
|
|
```bash
|
|
autrainer inference hf:autrainer/dcase2020-t1a-cnn14-32k-t <data-root> <output-root>
|
|
```
|
|
|
|
## Training
|
|
|
|
### Pretraining
|
|
|
|
The model has been originally trained on AudioSet by [Kong et. al.](https://zenodo.org/records/3987831).
|
|
|
|
### Dataset
|
|
|
|
The model has been further trained (finetuned) on the training set of the [DCASE 2020 Task 1A](http://dcase.community/challenge2020/task-acoustic-scene-classification) dataset.
|
|
The dataset comprises 10 different acoustic scenes recorded in 12 European cities with real and simulated devices.
|
|
The audio recordings were provided as 10-second segments with a sample rate of 48 kHz.
|
|
|
|
### Features
|
|
|
|
The DCASE 2020 Task 1A dataset was resampled to 32 kHz, as this was the sampling rate of AudioSet, which the model was pretrained on.
|
|
Then, log-Mel spectrograms were extracted with torchlibrosa using the parameters that the upstream model was trained on.
|
|
|
|
### Training Process
|
|
|
|
The model has been trained for 50 epochs.
|
|
At the end of each epoch, the model was evaluated on the validation set.
|
|
We release the state that achieved the best performance on this validation set.
|
|
All training hyperparameters can be found in the main configuration file (`conf/config.yaml`).
|
|
|
|
### Evaluation
|
|
|
|
No public test set is provided for the DCASE 2020 Task 1A dataset.
|
|
Therefore, we evaluate the model on the validation set.
|
|
The model achieves a classification accuracy of 0.67 on the validation set.
|
|
|
|
### Acknowledgements
|
|
|
|
Please acknowledge the work which produced the original model and the
|
|
DCASE 2020 Task 1A dataset.
|
|
We would also appreciate an acknowledgment to autrainer.
|
|
|