kangourous's picture
Update README.md
6232091 verified
metadata
title: Audio classification - XGBoost and small deep neural network
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: false

Audio Classification - XGBoost and Small Deep Neural Network

This is an ensemble of two models (XGBoost and a small CNN) for the audio classification task of the Frugal AI Challenge 2024. Instead of providing binary labels (0 or 1), both models predict a probability between 0 and 1. This allows a trade-off between precision and recall by setting a threshold within this range. Both models are trained independently, and their predictions are averaged.


Intended Use

  • Primary intended use: Identifying illegal logging in forests.

Training Data

The model uses the rfcx/frugalai dataset:

  • Size: ~50,000 examples.
  • Split: 70% training, 30% testing.
  • Binary labels:
    • 0: Chainsaw.
    • 1: Environment.

Data Preprocessing

Most audio samples are 3 seconds long, with a sampling rate of 12,000 Hz. This means each row of the dataset contains 36,000 elements.

  • Resampling: Audio files with a higher sampling rate are downsampled to 12,000 Hz.
  • Padding: Audio files shorter than 3 seconds are padded with their reversed signal at the end.
  • Storage: Raw audio data is stored in a NumPy array of size (n, 36000) with float16 precision to reduce memory usage without significant precision loss compared to float32.

Model Description: XGBoost

This is an XGBoost regressor that outputs probabilities. It consists of 3000 trees.

Input Features

XGBoost uses the following input features:

  • MFCC:
    • 55 MFCCs are retained.
    • Calculated with a window size of 1024 for nfft.
    • Mean and standard deviation are taken along the spatial axis (resulting in 110 features).
  • Mel Spectrogram:
    • Calculated with a window size of 1024 for nfft and 55 mel bands.
    • Mean and standard deviation along the spatial axis (110 features).
    • Standard deviation of the delta coefficients of the spectrogram (55 additional features). This captures the characteristic signature of chainsaw sounds transitioning from idle to full load. (See Exploratory Data Analysis)

Training Details

  • Framework: Python library xgboost.
  • Training on GPU using CUDA.
  • Learning rate: 0.02.
  • No data augmentation was used.

Training notebook: XGBoost Training Notebook


Model Description: CNN

This is a small convolutional neural network (CNN) with sigmoid activation and approximately 1M parameters.

Input Features

The CNN uses Log Mel Spectrograms as input features:

  • Calculated with a window size of 1,024 for nfft and 100 mel bands.

Training Details

  • Framework: PyTorch.
  • Optimizer: Adam.
  • Learning rate: 0.001.
  • Loss function: Binary Cross-Entropy Loss.
  • Data augmentation: Mixup was used with pairs of raw audios, such as the second element is always environnement (labeled as 1), and the resulting audio take the label of the first element. Thus, there is more variance in the environnment class, and more difficult samples in the chainsaw class.

Training notebook: CNN Training Notebook


Performance

In this challenge, accuracy and energy consumption are measured. The generation of spectrograms, MFCCs, and model inference are included in the energy consumption tracking. However, data loading is not included.

Metrics

  • XGBoost Accuracy: ~95.3%
  • CNN Accuracy: ~95.7%
  • Ensemble Accuracy: ~96.1%
  • Total Energy Consumption (in Wh, on NVIDIA T4): ~0.164

Precision-Recall Curve


Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

  • Carbon emissions during inference
  • Energy consumption during inference

This tracking helps establish a baseline for the environmental impact of model deployment and inference.

Limitations

  • The model can misclassify a motor sound (Chainsaw vs Motocross, for example)
  • This model is optimized to run on a specific hardware (Nvidia GPU)

Ethical Considerations

  • Environmental impact is tracked to promote awareness of AI's carbon footprint