yuvraj108c
/

float

Safetensors

Model card Files Files and versions

xet

Community

yuvraj108c commited on Jul 6

Commit

42dbe81

verified ·

1 Parent(s): 15c2fcf

Delete wav2vec-english-speech-emotion-recognition/README.md

Browse files

Files changed (1) hide show

wav2vec-english-speech-emotion-recognition/README.md +0 -86

wav2vec-english-speech-emotion-recognition/README.md DELETED Viewed

@@ -1,86 +0,0 @@
----
-license: apache-2.0
-tags:
-- generated_from_trainer
-metrics:
-- accuracy
-model_index:
-  name: wav2vec-english-speech-emotion-recognition
----
-# Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0
-The model is a fine-tuned version of [jonatasgrosman/wav2vec2-large-xlsr-53-english](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) for a Speech Emotion Recognition (SER) task.
-Several datasets were used the fine-tune the original model:
-- Surrey Audio-Visual Expressed Emotion [(SAVEE)](http://kahlan.eps.surrey.ac.uk/savee/Database.html) - 480 audio files from 4 male actors
-- Ryerson Audio-Visual Database of Emotional Speech and Song [(RAVDESS)](https://zenodo.org/record/1188976) - 1440 audio files from 24 professional actors (12 female, 12 male)
-- Toronto emotional speech set [(TESS)](https://tspace.library.utoronto.ca/handle/1807/24487) - 2800 audio files from 2 female actors
-7 labels/emotions were used as classification labels
-```python
-emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']
-```
-It achieves the following results on the evaluation set:
-- Loss: 0.104075
-- Accuracy: 0.97463
-## Model Usage
-```bash
-pip install transformers librosa torch
-```
-```python
-from transformers import *
-import librosa
-import torch
-feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
-model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
-def predict_emotion(audio_path):
-    audio, rate = librosa.load(audio_path, sr=16000)
-    inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
-    with torch.no_grad():
-        outputs = model(inputs.input_values)
-        predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1)  # Average over sequence length
-        predicted_label = torch.argmax(predictions, dim=-1)
-        emotion = model.config.id2label[predicted_label.item()]
-    return emotion
-emotion = predict_emotion("example_audio.wav")
-print(f"Predicted emotion: {emotion}")
->> Predicted emotion: angry
-```
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0001
-- train_batch_size: 4
-- eval_batch_size: 4
-- eval_steps: 500
-- seed: 42
-- gradient_accumulation_steps: 2
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- num_epochs: 4
-- max_steps=7500
-- save_steps: 1500
-### Training results
-| Step | Training Loss | Validation Loss | Accuracy |
-| ---- | ------------- | --------------- | -------- |
-| 500  | 1.8124        | 1.365212        | 0.486258 |
-| 1000 | 0.8872        | 0.773145        | 0.79704  |
-| 1500 | 0.7035        | 0.574954        | 0.852008 |
-| 2000 | 0.6879        | 1.286738        | 0.775899 |
-| 2500 | 0.6498        | 0.697455        | 0.832981 |
-| 3000 | 0.5696        | 0.33724         | 0.892178 |
-| 3500 | 0.4218        | 0.307072        | 0.911205 |
-| 4000 | 0.3088        | 0.374443        | 0.930233 |
-| 4500 | 0.2688        | 0.260444        | 0.936575 |
-| 5000 | 0.2973        | 0.302985        | 0.92389  |
-| 5500 | 0.1765        | 0.165439        | 0.961945 |
-| 6000 | 0.1475        | 0.170199        | 0.961945 |
-| 6500 | 0.1274        | 0.15531         | 0.966173 |
-| 7000 | 0.0699        | 0.103882        | 0.976744 |
-| 7500 | 0.083         | 0.104075        | 0.97463  |