kangourous's picture
Update README.md
1aee604 verified
|
raw
history blame
3.49 kB
metadata
title: Audio classification - XGBoost and small deep neural network
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: false

Audio classification - XGBoost and small deep neural network

This is an ensemble of 2 models (XGBoost and small CNN) for the Audio task of the Frugal AI Challenge 2024. Instead of giving binary labels (0 or 1), both models predict a probability, between 0 and 1. This allow to make a tradeof between precision and recall by setting a threshold between 0 and 1. Both models are trained independently, and their prediction has been averaged.

Intended Use

  • Primary intended uses: Identify illegal logging in forests

Training Data

The model uses the rfcx/frugalai dataset:

  • Size: ~50000 examples
  • Split: 70% train, 30% test
  • Binary labels | 0: Chainsaw; 1: Environnement

Data Loading

Most of the audio are 3 seconds long, with a sampling rate of 12000, wich means that each row of the dataset contains 36000 elements. A few files have a bigger sampling rate, so we resample them at 12000. For audios that are smaller than 3 seconds, we add a reverse padding at the end. Raw audio data are stored in a numpy array of size (n, 36000), with float 16 precision to gain in memory usage (there is no drop in precision compared to float 32).

Model Description : XGBoost

This is an XGBoost Regressor, with probability in output, composed of 3000 trees.

Input features

XGBoost uses as input features:

  • MFCC : 55 mfcc are retains. A window size of 1024 is used for nfft. I took the mean and standard deviation along the spatial axis (110 features)
  • Mel Spectrogram : The mel spectrogram is calculated with a window size of 1024 for nfft, and 55 mel. I took the mean and standard deviation along the spatial axis (110 features). In addition, I added the standard deviation of the delta coefficient of the spectrogram, in order to capture the caracteristic signature of the chainsaw sound, when it goes from idle to full load (55 more features).

Training Details

I used the python library xgboost. I trained the model using cuda, with a learning rate of 0.02. No data augmentation were used.

Model Description : CNN

This is an CNN, with a sigmoid activation, of almost 1M parameters.

Input features

This model uses as input features Log Mel Spectrogram, calculated with a window size of 1024 for nfft, and 100 mel.

Training Details

Pytorch was used to trained this model, with an Adam Optimizer with a learning rate of 0.001, a Binary Cross Entropy Loss. I randomly added sound labeled as environnement, to add more noise to the dataset, without changing the labels.

Performance

In this challenge we mesure accuracy and energy consumption of the model. The generation of the spectrograms and MFCCs, (and model inference of course) are included in the energy consumption tracking. However, I didn't include the data loading part.

Metrics

  • Accuracy (0.5 threshold): ~96.1%
  • Energy consumption in Wh: ~0.164

Environmental Impact

Environmental impact is tracked using CodeCarbon, measuring:

  • Carbon emissions during inference
  • Energy consumption during inference

This tracking helps establish a baseline for the environmental impact of model deployment and inference.

Limitations

  • The model can classify as a Chainsaw another motor sound

Ethical Considerations

  • Environmental impact is tracked to promote awareness of AI's carbon footprint