Delete wav2vec-english-speech-emotion-recognition/README.md
Browse files
wav2vec-english-speech-emotion-recognition/README.md
DELETED
|
@@ -1,86 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
tags:
|
| 4 |
-
- generated_from_trainer
|
| 5 |
-
metrics:
|
| 6 |
-
- accuracy
|
| 7 |
-
model_index:
|
| 8 |
-
name: wav2vec-english-speech-emotion-recognition
|
| 9 |
-
---
|
| 10 |
-
# Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0
|
| 11 |
-
The model is a fine-tuned version of [jonatasgrosman/wav2vec2-large-xlsr-53-english](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) for a Speech Emotion Recognition (SER) task.
|
| 12 |
-
|
| 13 |
-
Several datasets were used the fine-tune the original model:
|
| 14 |
-
- Surrey Audio-Visual Expressed Emotion [(SAVEE)](http://kahlan.eps.surrey.ac.uk/savee/Database.html) - 480 audio files from 4 male actors
|
| 15 |
-
- Ryerson Audio-Visual Database of Emotional Speech and Song [(RAVDESS)](https://zenodo.org/record/1188976) - 1440 audio files from 24 professional actors (12 female, 12 male)
|
| 16 |
-
- Toronto emotional speech set [(TESS)](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 audio files from 2 female actors
|
| 17 |
-
|
| 18 |
-
7 labels/emotions were used as classification labels
|
| 19 |
-
```python
|
| 20 |
-
emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']
|
| 21 |
-
```
|
| 22 |
-
It achieves the following results on the evaluation set:
|
| 23 |
-
- Loss: 0.104075
|
| 24 |
-
- Accuracy: 0.97463
|
| 25 |
-
|
| 26 |
-
## Model Usage
|
| 27 |
-
```bash
|
| 28 |
-
pip install transformers librosa torch
|
| 29 |
-
```
|
| 30 |
-
```python
|
| 31 |
-
from transformers import *
|
| 32 |
-
import librosa
|
| 33 |
-
import torch
|
| 34 |
-
|
| 35 |
-
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
|
| 36 |
-
model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
|
| 37 |
-
|
| 38 |
-
def predict_emotion(audio_path):
|
| 39 |
-
audio, rate = librosa.load(audio_path, sr=16000)
|
| 40 |
-
inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
|
| 41 |
-
|
| 42 |
-
with torch.no_grad():
|
| 43 |
-
outputs = model(inputs.input_values)
|
| 44 |
-
predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1) # Average over sequence length
|
| 45 |
-
predicted_label = torch.argmax(predictions, dim=-1)
|
| 46 |
-
emotion = model.config.id2label[predicted_label.item()]
|
| 47 |
-
return emotion
|
| 48 |
-
|
| 49 |
-
emotion = predict_emotion("example_audio.wav")
|
| 50 |
-
print(f"Predicted emotion: {emotion}")
|
| 51 |
-
>> Predicted emotion: angry
|
| 52 |
-
```
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
## Training procedure
|
| 56 |
-
### Training hyperparameters
|
| 57 |
-
The following hyperparameters were used during training:
|
| 58 |
-
- learning_rate: 0.0001
|
| 59 |
-
- train_batch_size: 4
|
| 60 |
-
- eval_batch_size: 4
|
| 61 |
-
- eval_steps: 500
|
| 62 |
-
- seed: 42
|
| 63 |
-
- gradient_accumulation_steps: 2
|
| 64 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
| 65 |
-
- num_epochs: 4
|
| 66 |
-
- max_steps=7500
|
| 67 |
-
- save_steps: 1500
|
| 68 |
-
|
| 69 |
-
### Training results
|
| 70 |
-
| Step | Training Loss | Validation Loss | Accuracy |
|
| 71 |
-
| ---- | ------------- | --------------- | -------- |
|
| 72 |
-
| 500 | 1.8124 | 1.365212 | 0.486258 |
|
| 73 |
-
| 1000 | 0.8872 | 0.773145 | 0.79704 |
|
| 74 |
-
| 1500 | 0.7035 | 0.574954 | 0.852008 |
|
| 75 |
-
| 2000 | 0.6879 | 1.286738 | 0.775899 |
|
| 76 |
-
| 2500 | 0.6498 | 0.697455 | 0.832981 |
|
| 77 |
-
| 3000 | 0.5696 | 0.33724 | 0.892178 |
|
| 78 |
-
| 3500 | 0.4218 | 0.307072 | 0.911205 |
|
| 79 |
-
| 4000 | 0.3088 | 0.374443 | 0.930233 |
|
| 80 |
-
| 4500 | 0.2688 | 0.260444 | 0.936575 |
|
| 81 |
-
| 5000 | 0.2973 | 0.302985 | 0.92389 |
|
| 82 |
-
| 5500 | 0.1765 | 0.165439 | 0.961945 |
|
| 83 |
-
| 6000 | 0.1475 | 0.170199 | 0.961945 |
|
| 84 |
-
| 6500 | 0.1274 | 0.15531 | 0.966173 |
|
| 85 |
-
| 7000 | 0.0699 | 0.103882 | 0.976744 |
|
| 86 |
-
| 7500 | 0.083 | 0.104075 | 0.97463 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|