Spaces:
Runtime error
title: Audio classification - XGBoost and small deep neural network
emoji: 🔥
colorFrom: yellow
colorTo: green
sdk: docker
pinned: false
Audio Classification - XGBoost and Small Deep Neural Network
This is an ensemble of two models (XGBoost and a small CNN) for the audio classification task of the Frugal AI Challenge 2024. Instead of providing binary labels (0 or 1), both models predict a probability between 0 and 1. This allows a trade-off between precision and recall by setting a threshold within this range. Both models are trained independently, and their predictions are averaged.
Intended Use
- Primary intended use: Identifying illegal logging in forests.
Training Data
The model uses the rfcx/frugalai
dataset:
- Size: ~50,000 examples.
- Split: 70% training, 30% testing.
- Binary labels:
0
: Chainsaw.1
: Environment.
Data Preprocessing
Most audio samples are 3 seconds long, with a sampling rate of 12,000 Hz. This means each row of the dataset contains 36,000 elements.
- Resampling: Audio files with a higher sampling rate are downsampled to 12,000 Hz.
- Padding: Audio files shorter than 3 seconds are padded with their reversed signal at the end.
- Storage: Raw audio data is stored in a NumPy array of size
(n, 36000)
withfloat16
precision to reduce memory usage without significant precision loss compared tofloat32
.
Model Description: XGBoost
This is an XGBoost regressor that outputs probabilities. It consists of 3000 trees.
Input Features
XGBoost uses the following input features:
- MFCC:
- 55 MFCCs are retained.
- Calculated with a window size of 1024 for
nfft
. - Mean and standard deviation are taken along the spatial axis (resulting in 110 features).
- Mel Spectrogram:
- Calculated with a window size of 1024 for
nfft
and 55 mel bands. - Mean and standard deviation along the spatial axis (110 features).
- Standard deviation of the delta coefficients of the spectrogram (55 additional features). This captures the characteristic signature of chainsaw sounds transitioning from idle to full load. (See Exploratory Data Analysis)
- Calculated with a window size of 1024 for
Training Details
- Framework: Python library
xgboost
. - Training on GPU using CUDA.
- Learning rate:
0.02
. - No data augmentation was used.
Training notebook: XGBoost Training Notebook
Model Description: CNN
This is a small convolutional neural network (CNN) with sigmoid activation and approximately 1M parameters.
Input Features
The CNN uses Log Mel Spectrograms as input features:
- Calculated with a window size of 1,024 for
nfft
and 100 mel bands.
Training Details
- Framework: PyTorch.
- Optimizer: Adam.
- Learning rate:
0.001
. - Loss function: Binary Cross-Entropy Loss.
- Data augmentation: Mixup was used with pairs of raw audios, such as the second element is always environnement (labeled as 1), and the resulting audio take the label of the first element. Thus, there is more variance in the environnment class, and more difficult samples in the chainsaw class.
Training notebook: CNN Training Notebook
Performance
In this challenge, accuracy and energy consumption are measured. The generation of spectrograms, MFCCs, and model inference are included in the energy consumption tracking. However, data loading is not included.
Metrics
- XGBoost Accuracy: ~95.3%
- CNN Accuracy: ~95.7%
- Ensemble Accuracy: ~96.1%
- Total Energy Consumption (in Wh, on NVIDIA T4): ~0.164
Environmental Impact
Environmental impact is tracked using CodeCarbon, measuring:
- Carbon emissions during inference
- Energy consumption during inference
This tracking helps establish a baseline for the environmental impact of model deployment and inference.
Limitations
- The model can misclassify a motor sound (Chainsaw vs Motocross, for example)
- This model is optimized to run on a specific hardware (Nvidia GPU)
Ethical Considerations
- Environmental impact is tracked to promote awareness of AI's carbon footprint