--- title: Audio classification - XGBoost and small deep neural network emoji: 🔥 colorFrom: yellow colorTo: green sdk: docker pinned: false --- # Audio classification - XGBoost and small deep neural network This is an ensemble of 2 models (XGBoost and small CNN) for the Audio task of the Frugal AI Challenge 2024. Instead of giving binary labels (0 or 1), both models predict a probability, between 0 and 1. This allow to make a tradeof between precision and recall by setting a threshold between 0 and 1. Both models are trained independently, and their prediction has been averaged. ### Intended Use - **Primary intended uses**: Identify illegal logging in forests ## Training Data The model uses the rfcx/frugalai dataset: - Size: ~50000 examples - Split: 70% train, 30% test - Binary labels | 0: Chainsaw; 1: Environnement ### Data Loading Most of the audio are 3 seconds long, with a sampling rate of 12000, wich means that each row of the dataset contains 36000 elements. A few files have a bigger sampling rate, so we resample them at 12000. For audios that are smaller than 3 seconds, we add a reverse padding at the end. Raw audio data are stored in a numpy array of size (n, 36000), with float 16 precision to gain in memory usage (there is no drop in precision compared to float 32). ## Model Description : XGBoost This is an XGBoost Regressor, with probability in output, composed of 3000 trees. ### Input features XGBoost uses as input features: - **MFCC** : 55 mfcc are retains. A window size of 1024 is used for nfft. I took the mean and standard deviation along the spatial axis (110 features) - **Mel Spectrogram** : The mel spectrogram is calculated with a window size of 1024 for nfft, and 55 mel. I took the mean and standard deviation along the spatial axis (110 features). In addition, I added the standard deviation of the delta coefficient of the spectrogram, in order to capture the caracteristic signature of the chainsaw sound, when it goes from idle to full load (55 more features). ### Training Details I used the python library xgboost. I trained the model using cuda, with a learning rate of 0.02. No data augmentation were used. ## Model Description : CNN This is an CNN, with a sigmoid activation, of almost 1M parameters. ### Input features This model uses as input features **Log Mel Spectrogram**, calculated with a window size of 1024 for nfft, and 100 mel. ### Training Details Pytorch was used to trained this model, with an Adam Optimizer with a learning rate of 0.001, a Binary Cross Entropy Loss. I randomly added sound labeled as environnement, to add more noise to the dataset, without changing the labels. ## Performance In this challenge we mesure accuracy and energy consumption of the model. The generation of the spectrograms and MFCCs, (and model inference of course) are included in the energy consumption tracking. However, I didn't include the data loading part. ### Metrics - **Accuracy (0.5 threshold)**: ~96.1% - **Energy consumption in Wh**: ~0.164 ## Environmental Impact Environmental impact is tracked using CodeCarbon, measuring: - Carbon emissions during inference - Energy consumption during inference This tracking helps establish a baseline for the environmental impact of model deployment and inference. ## Limitations - The model can classify as a Chainsaw another motor sound ## Ethical Considerations - Environmental impact is tracked to promote awareness of AI's carbon footprint ```