Upload Tiny-AST MAD classifier with 96.73% accuracy - 2025-08-20 11:01
Browse files- README.md +183 -0
- config.json +42 -0
- inference_example.py +59 -0
- model.safetensors +3 -0
- preprocessor_config.json +13 -0
- training_info.json +52 -0
README.md
ADDED
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
license: apache-2.0
|
4 |
+
tags:
|
5 |
+
- audio-classification
|
6 |
+
- military-audio
|
7 |
+
- ast
|
8 |
+
- tiny-ast
|
9 |
+
- pytorch
|
10 |
+
- transformers
|
11 |
+
- surveillance
|
12 |
+
- edge-deployment
|
13 |
+
metrics:
|
14 |
+
- accuracy
|
15 |
+
- f1
|
16 |
+
model-index:
|
17 |
+
- name: tiny-ast-mad-military-audio-classifier
|
18 |
+
results:
|
19 |
+
- task:
|
20 |
+
type: audio-classification
|
21 |
+
name: Military Audio Classification
|
22 |
+
dataset:
|
23 |
+
name: MAD Dataset
|
24 |
+
type: military-audio
|
25 |
+
metrics:
|
26 |
+
- type: accuracy
|
27 |
+
value: 0.9673
|
28 |
+
name: Accuracy
|
29 |
+
- type: f1
|
30 |
+
value: 0.9674
|
31 |
+
name: F1-weighted
|
32 |
+
---
|
33 |
+
|
34 |
+
# Tiny-AST Military Audio Classifier
|
35 |
+
|
36 |
+
🎖️ **State-of-the-art military audio classification model** achieving **96.73% accuracy** on the Military Audio Dataset (MAD).
|
37 |
+
|
38 |
+
## Model Description
|
39 |
+
|
40 |
+
This model is a fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) on the Military Audio Dataset (MAD). It's designed for **edge deployment** on devices like Raspberry Pi 5 for military surveillance applications.
|
41 |
+
|
42 |
+
### Key Features
|
43 |
+
- 🎯 **96.73% accuracy** on MAD dataset (7 military audio classes)
|
44 |
+
- 🚀 **Edge-optimized** for Raspberry Pi deployment
|
45 |
+
- ⚡ **Fast inference** (<200ms per sample)
|
46 |
+
- 🧠 **Efficient** (16.5% of parameters fine-tuned)
|
47 |
+
- 🔊 **Robust** to real-world military environments
|
48 |
+
|
49 |
+
## Training Results
|
50 |
+
|
51 |
+
### Progressive Training Performance:
|
52 |
+
- **Phase 1** (Classifier only): 94.32% accuracy
|
53 |
+
- **Phase 2** (Top 2 layers): 96.73% accuracy ← **Best Model**
|
54 |
+
- **Phase 3** (Top 4 layers): 96.35% accuracy
|
55 |
+
- **Phase 4** (Top 6 layers): 96.73% accuracy
|
56 |
+
|
57 |
+
### Training Configuration:
|
58 |
+
- **Method**: Progressive unfreezing strategy
|
59 |
+
- **Learning Rates**: Conservative (1e-4 → 2e-5)
|
60 |
+
- **Normalization**: MAD-specific statistics (mean: -2.16, std: 2.85)
|
61 |
+
- **Class Weighting**: Balanced for imbalanced dataset
|
62 |
+
- **Training Time**: 40 minutes on RTX 3060
|
63 |
+
|
64 |
+
## Model Classes
|
65 |
+
|
66 |
+
The model classifies 7 military audio categories:
|
67 |
+
|
68 |
+
| Class ID | Class Name | Training Samples | Test Samples |
|
69 |
+
|----------|------------|------------------|--------------|
|
70 |
+
| 0 | Communication | 774 | 207 |
|
71 |
+
| 1 | Footsteps | 1,293 | 280 |
|
72 |
+
| 2 | Gunshot | 773 | 104 |
|
73 |
+
| 3 | Shelling | 883 | 104 |
|
74 |
+
| 4 | Vehicle | 910 | 122 |
|
75 |
+
| 5 | Helicopter | 934 | 91 |
|
76 |
+
| 6 | Fighter | 862 | 129 |
|
77 |
+
|
78 |
+
## Usage
|
79 |
+
|
80 |
+
### Quick Start
|
81 |
+
```python
|
82 |
+
from transformers import ASTForAudioClassification, ASTFeatureExtractor
|
83 |
+
import librosa
|
84 |
+
import torch
|
85 |
+
|
86 |
+
# Load model and feature extractor
|
87 |
+
model = ASTForAudioClassification.from_pretrained("Akashpaul123/tiny-ast-mad-military-audio-classifier")
|
88 |
+
feature_extractor = ASTFeatureExtractor.from_pretrained("Akashpaul123/tiny-ast-mad-military-audio-classifier")
|
89 |
+
|
90 |
+
# Load audio file (16kHz recommended)
|
91 |
+
audio, sr = librosa.load("military_audio.wav", sr=16000)
|
92 |
+
|
93 |
+
# Extract features
|
94 |
+
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
|
95 |
+
|
96 |
+
# Predict
|
97 |
+
with torch.no_grad():
|
98 |
+
outputs = model(**inputs)
|
99 |
+
predicted_class = torch.argmax(outputs.logits, dim=-1).item()
|
100 |
+
|
101 |
+
# Class mapping
|
102 |
+
classes = ['Communication', 'Footsteps', 'Gunshot', 'Shelling', 'Vehicle', 'Helicopter', 'Fighter']
|
103 |
+
print(f"Predicted class: {classes[predicted_class]}")
|
104 |
+
```
|
105 |
+
|
106 |
+
### Edge Deployment (Raspberry Pi 5)
|
107 |
+
```python
|
108 |
+
import onnxruntime as ort
|
109 |
+
|
110 |
+
# Load ONNX model for edge inference
|
111 |
+
session = ort.InferenceSession("tiny_ast_mad_optimized.onnx")
|
112 |
+
# ... inference code
|
113 |
+
```
|
114 |
+
|
115 |
+
## Training Details
|
116 |
+
|
117 |
+
### Dataset
|
118 |
+
- **Source**: Military Audio Dataset (MAD)
|
119 |
+
- **Total Samples**: 7,466 audio files
|
120 |
+
- **Duration**: 2-8 seconds per sample
|
121 |
+
- **Sample Rate**: 16kHz
|
122 |
+
- **Augmentation**: Military-specific (time stretch, pitch shift, noise injection)
|
123 |
+
|
124 |
+
### Architecture
|
125 |
+
- **Base Model**: Audio Spectrogram Transformer (AST)
|
126 |
+
- **Parameters**: 86.2M total, 14.2M trainable (16.5%)
|
127 |
+
- **Input**: Log-Mel spectrograms (1024 x 128)
|
128 |
+
- **Output**: 7 military audio classes
|
129 |
+
|
130 |
+
### Performance Metrics
|
131 |
+
- **Accuracy**: 96.73%
|
132 |
+
- **F1-Macro**: 96.84%
|
133 |
+
- **F1-Weighted**: 96.74%
|
134 |
+
- **Precision**: High across all classes
|
135 |
+
- **Recall**: Balanced performance
|
136 |
+
|
137 |
+
## Hardware Requirements
|
138 |
+
|
139 |
+
### Training
|
140 |
+
- **GPU**: RTX 3060 (12GB VRAM) or similar
|
141 |
+
- **RAM**: 16GB+ recommended
|
142 |
+
- **Storage**: 50GB for dataset and models
|
143 |
+
|
144 |
+
### Inference (Edge)
|
145 |
+
- **Device**: Raspberry Pi 5 or similar ARM device
|
146 |
+
- **RAM**: 2GB minimum
|
147 |
+
- **Inference Time**: <200ms per sample
|
148 |
+
- **Power**: <5W continuous operation
|
149 |
+
|
150 |
+
## Limitations and Considerations
|
151 |
+
|
152 |
+
- **Domain-specific**: Optimized for military audio contexts
|
153 |
+
- **Language**: Primarily English communication samples
|
154 |
+
- **Environment**: Trained on MAD dataset conditions
|
155 |
+
- **Real-time**: Designed for batch processing, not streaming
|
156 |
+
|
157 |
+
## Citation
|
158 |
+
|
159 |
+
If you use this model in your research, please cite:
|
160 |
+
|
161 |
+
```bibtex
|
162 |
+
@misc{tiny-ast-mad-2024,
|
163 |
+
title={Tiny-AST Military Audio Classifier: Progressive Fine-tuning for Edge Deployment},
|
164 |
+
author={Paul, Akash},
|
165 |
+
year={2024},
|
166 |
+
howpublished={Hugging Face Model Hub},
|
167 |
+
url={https://huggingface.co/Akashpaul123/tiny-ast-mad-military-audio-classifier}
|
168 |
+
}
|
169 |
+
```
|
170 |
+
|
171 |
+
## License
|
172 |
+
|
173 |
+
This model is licensed under the Apache 2.0 License.
|
174 |
+
|
175 |
+
## Contact
|
176 |
+
|
177 |
+
- **Author**: Akash Paul
|
178 |
+
- **GitHub**: [@akashpaul123](https://github.com/akashpaul123)
|
179 |
+
- **Hugging Face**: [@akashpaul123](https://huggingface.co/akashpaul123)
|
180 |
+
|
181 |
+
---
|
182 |
+
|
183 |
+
*Model trained as part of military audio surveillance research with focus on edge deployment and real-world robustness.*
|
config.json
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"ASTForAudioClassification"
|
4 |
+
],
|
5 |
+
"attention_probs_dropout_prob": 0.0,
|
6 |
+
"frequency_stride": 10,
|
7 |
+
"hidden_act": "gelu",
|
8 |
+
"hidden_dropout_prob": 0.0,
|
9 |
+
"hidden_size": 768,
|
10 |
+
"id2label": {
|
11 |
+
"0": "LABEL_0",
|
12 |
+
"1": "LABEL_1",
|
13 |
+
"2": "LABEL_2",
|
14 |
+
"3": "LABEL_3",
|
15 |
+
"4": "LABEL_4",
|
16 |
+
"5": "LABEL_5",
|
17 |
+
"6": "LABEL_6"
|
18 |
+
},
|
19 |
+
"initializer_range": 0.02,
|
20 |
+
"intermediate_size": 3072,
|
21 |
+
"label2id": {
|
22 |
+
"LABEL_0": 0,
|
23 |
+
"LABEL_1": 1,
|
24 |
+
"LABEL_2": 2,
|
25 |
+
"LABEL_3": 3,
|
26 |
+
"LABEL_4": 4,
|
27 |
+
"LABEL_5": 5,
|
28 |
+
"LABEL_6": 6
|
29 |
+
},
|
30 |
+
"layer_norm_eps": 1e-12,
|
31 |
+
"max_length": 1024,
|
32 |
+
"model_type": "audio-spectrogram-transformer",
|
33 |
+
"num_attention_heads": 12,
|
34 |
+
"num_hidden_layers": 12,
|
35 |
+
"num_mel_bins": 128,
|
36 |
+
"patch_size": 16,
|
37 |
+
"problem_type": "single_label_classification",
|
38 |
+
"qkv_bias": true,
|
39 |
+
"time_stride": 10,
|
40 |
+
"torch_dtype": "float32",
|
41 |
+
"transformers_version": "4.55.2"
|
42 |
+
}
|
inference_example.py
ADDED
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Example inference script for Tiny-AST MAD Military Audio Classifier
|
3 |
+
"""
|
4 |
+
|
5 |
+
from transformers import ASTForAudioClassification, ASTFeatureExtractor
|
6 |
+
import librosa
|
7 |
+
import torch
|
8 |
+
import numpy as np
|
9 |
+
|
10 |
+
def classify_military_audio(audio_path, model_name="akashpaul123/tiny-ast-mad-military-audio-classifier"):
|
11 |
+
"""
|
12 |
+
Classify military audio using the fine-tuned Tiny-AST model
|
13 |
+
|
14 |
+
Args:
|
15 |
+
audio_path (str): Path to audio file
|
16 |
+
model_name (str): Hugging Face model name
|
17 |
+
|
18 |
+
Returns:
|
19 |
+
dict: Classification results
|
20 |
+
"""
|
21 |
+
|
22 |
+
# Load model and feature extractor
|
23 |
+
model = ASTForAudioClassification.from_pretrained(model_name)
|
24 |
+
feature_extractor = ASTFeatureExtractor.from_pretrained(model_name)
|
25 |
+
|
26 |
+
# Load and preprocess audio
|
27 |
+
audio, sr = librosa.load(audio_path, sr=16000, duration=10.0)
|
28 |
+
|
29 |
+
# Extract features
|
30 |
+
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
|
31 |
+
|
32 |
+
# Predict
|
33 |
+
with torch.no_grad():
|
34 |
+
outputs = model(**inputs)
|
35 |
+
probabilities = torch.softmax(outputs.logits, dim=-1)
|
36 |
+
predicted_class = torch.argmax(probabilities, dim=-1).item()
|
37 |
+
confidence = probabilities[0][predicted_class].item()
|
38 |
+
|
39 |
+
# Class mapping
|
40 |
+
classes = ['Communication', 'Footsteps', 'Gunshot', 'Shelling',
|
41 |
+
'Vehicle', 'Helicopter', 'Fighter']
|
42 |
+
|
43 |
+
return {
|
44 |
+
'predicted_class': classes[predicted_class],
|
45 |
+
'class_id': predicted_class,
|
46 |
+
'confidence': confidence,
|
47 |
+
'all_probabilities': {cls: prob.item() for cls, prob in zip(classes, probabilities[0])}
|
48 |
+
}
|
49 |
+
|
50 |
+
# Example usage
|
51 |
+
if __name__ == "__main__":
|
52 |
+
# Replace with your audio file path
|
53 |
+
result = classify_military_audio("path/to/your/military_audio.wav")
|
54 |
+
|
55 |
+
print(f"Predicted class: {result['predicted_class']}")
|
56 |
+
print(f"Confidence: {result['confidence']:.4f}")
|
57 |
+
print("\nAll class probabilities:")
|
58 |
+
for class_name, prob in result['all_probabilities'].items():
|
59 |
+
print(f" {class_name}: {prob:.4f}")
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:938ba3a3d129bf148bfab506bd8284a1d79b819008e17d1cb3c862836fb109b1
|
3 |
+
size 344805420
|
preprocessor_config.json
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"do_normalize": true,
|
3 |
+
"feature_extractor_type": "ASTFeatureExtractor",
|
4 |
+
"feature_size": 1,
|
5 |
+
"max_length": 1024,
|
6 |
+
"mean": -2.164904,
|
7 |
+
"num_mel_bins": 128,
|
8 |
+
"padding_side": "right",
|
9 |
+
"padding_value": 0.0,
|
10 |
+
"return_attention_mask": false,
|
11 |
+
"sampling_rate": 16000,
|
12 |
+
"std": 2.854887
|
13 |
+
}
|
training_info.json
ADDED
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"model_name": "tiny-ast-mad-military-audio-classifier",
|
3 |
+
"base_model": "MIT/ast-finetuned-audioset-10-10-0.4593",
|
4 |
+
"dataset": "Military Audio Dataset (MAD)",
|
5 |
+
"training_method": "Progressive Unfreezing",
|
6 |
+
"best_phase": 2,
|
7 |
+
"final_accuracy": 0.9673,
|
8 |
+
"final_f1_weighted": 0.9674,
|
9 |
+
"training_time_minutes": 40.1,
|
10 |
+
"classes": [
|
11 |
+
"Communication",
|
12 |
+
"Footsteps",
|
13 |
+
"Gunshot",
|
14 |
+
"Shelling",
|
15 |
+
"Vehicle",
|
16 |
+
"Helicopter",
|
17 |
+
"Fighter"
|
18 |
+
],
|
19 |
+
"class_mapping": {
|
20 |
+
"0": "Communication",
|
21 |
+
"1": "Footsteps",
|
22 |
+
"2": "Gunshot",
|
23 |
+
"3": "Shelling",
|
24 |
+
"4": "Vehicle",
|
25 |
+
"5": "Helicopter",
|
26 |
+
"6": "Fighter"
|
27 |
+
},
|
28 |
+
"normalization_stats": {
|
29 |
+
"mean": -2.164904,
|
30 |
+
"std": 2.854887
|
31 |
+
},
|
32 |
+
"phase_results": {
|
33 |
+
"phase_1": {
|
34 |
+
"accuracy": 0.9432,
|
35 |
+
"f1_weighted": 0.9432
|
36 |
+
},
|
37 |
+
"phase_2": {
|
38 |
+
"accuracy": 0.9673,
|
39 |
+
"f1_weighted": 0.9674
|
40 |
+
},
|
41 |
+
"phase_3": {
|
42 |
+
"accuracy": 0.9635,
|
43 |
+
"f1_weighted": 0.9635
|
44 |
+
},
|
45 |
+
"phase_4": {
|
46 |
+
"accuracy": 0.9673,
|
47 |
+
"f1_weighted": 0.9674
|
48 |
+
}
|
49 |
+
},
|
50 |
+
"upload_date": "2025-08-20T11:01:50.672302",
|
51 |
+
"hardware_used": "RTX 3060 (12GB VRAM)"
|
52 |
+
}
|