---
license: cc-by-4.0
metrics:
  - accuracy
  - f1
  - uar
pipeline_tag: audio-classification
tags:
  - audio
  - audio-classification
  - acoustic-scene-classification
  - autrainer
library_name: autrainer
model-index:
  - name: dcase-2020-t1a-cnn14-32k-t
    results:
      - task:
          type: audio-classification
          name: Acoustic Scene Classification
        metrics:
          - type: accuracy
            name: Accuracy
            value: 0.6778975741239892
          - type: f1
            name: F1
            value: 0.6749168062342605
          - type: uar
            name: Unweighted Average Recall
            value: 0.6778357903357903
---

# Acoustic Scene Classification Model

`CNN14` model from the [PANN](https://zenodo.org/records/3987831) family that classifies audio files into one of the following 10 different acoustic scenes:
_airport_, _bus_, _metro_, _metro_station_, _park_, _public_square_, _shopping_mall_, _street_pedestrian_, _street_traffic_, and _tram_.

## Installation

To use the model, you have to install autrainer, e.g. via pip:

```bash
pip install autrainer
```

## Usage

The model can be applied on all audio files present in a folder (`<data-root>`) and stores the predictions in another folder (`<output-root>`):

```bash
autrainer inference hf:autrainer/dcase2020-t1a-cnn14-32k-t <data-root> <output-root>
```

## Training

### Pretraining

The model has been originally trained on AudioSet by [Kong et. al.](https://zenodo.org/records/3987831).

### Dataset

The model has been further trained (finetuned) on the training set of the [DCASE 2020 Task 1A](http://dcase.community/challenge2020/task-acoustic-scene-classification) dataset.
The dataset comprises 10 different acoustic scenes recorded in 12 European cities with real and simulated devices.
The audio recordings were provided as 10-second segments with a sample rate of 48 kHz.

### Features

The DCASE 2020 Task 1A dataset was resampled to 32 kHz, as this was the sampling rate of AudioSet, which the model was pretrained on.
Then, log-Mel spectrograms were extracted with torchlibrosa using the parameters that the upstream model was trained on.

### Training Process

The model has been trained for 50 epochs.
At the end of each epoch, the model was evaluated on the validation set.
We release the state that achieved the best performance on this validation set.
All training hyperparameters can be found in the main configuration file (`conf/config.yaml`).

### Evaluation

No public test set is provided for the DCASE 2020 Task 1A dataset.
Therefore, we evaluate the model on the validation set.
The model achieves a classification accuracy of 0.67 on the validation set.

### Acknowledgements

Please acknowledge the work which produced the original model and the
DCASE 2020 Task 1A dataset.
We would also appreciate an acknowledgment to autrainer.