Acoustic Scene Classification Model

CNN14 model from the PANN family that classifies audio files into one of the following 10 different acoustic scenes: airport, bus, metro, metro_station, park, public_square, shopping_mall, street_pedestrian, street_traffic, and tram.

Installation

To use the model, you have to install autrainer, e.g. via pip:

pip install autrainer

Usage

The model can be applied on all audio files present in a folder (<data-root>) and stores the predictions in another folder (<output-root>):

autrainer inference hf:autrainer/dcase2020-t1a-cnn14-32k-t <data-root> <output-root>

Training

Pretraining

The model has been originally trained on AudioSet by Kong et. al..

Dataset

The model has been further trained (finetuned) on the training set of the DCASE 2020 Task 1A dataset. The dataset comprises 10 different acoustic scenes recorded in 12 European cities with real and simulated devices. The audio recordings were provided as 10-second segments with a sample rate of 48 kHz.

Features

The DCASE 2020 Task 1A dataset was resampled to 32 kHz, as this was the sampling rate of AudioSet, which the model was pretrained on. Then, log-Mel spectrograms were extracted with torchlibrosa using the parameters that the upstream model was trained on.

Training Process

The model has been trained for 50 epochs. At the end of each epoch, the model was evaluated on the validation set. We release the state that achieved the best performance on this validation set. All training hyperparameters can be found in the main configuration file (conf/config.yaml).

Evaluation

No public test set is provided for the DCASE 2020 Task 1A dataset. Therefore, we evaluate the model on the validation set. The model achieves a classification accuracy of 0.67 on the validation set.

Acknowledgements

Please acknowledge the work which produced the original model and the DCASE 2020 Task 1A dataset. We would also appreciate an acknowledgment to autrainer.

autrainer
/

dcase-2020-t1a-cnn14-32k-t