Susurro: Spanish Speech Recognition Model

Model Description

Susurro is a fine-tuned version of OpenAI's Whisper model, specifically optimized for Spanish speech recognition. This model has been trained on Spanish speech datasets to improve its performance for Spanish language transcription tasks.

Training Data

The model was trained on a Spanish speech dataset consisting of:

  • Training set: Spanish speech audio samples
  • Test set: Separate validation audio samples
  • Audio sampling rate: 16kHz
  • Language: Spanish
  • Task: Speech transcription

Training Procedure

The model was trained using the following configuration:

  • Base model: openai/whisper-large-v3-turbo
  • Training type: Fine-tuning
  • Batch size: 2 per device
  • Gradient accumulation steps: 16
  • Learning rate: 1e-5
  • Warmup steps: 500
  • Max steps: 8000
  • Training optimizations:
    • Gradient checkpointing enabled
    • FP16 training
    • 8-bit Adam optimizer

Intended Uses

This model is designed for:

  • Spanish speech recognition
  • Audio transcription in Spanish
  • Real-time speech-to-text applications

How to Use

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("IsmaelRR/SusurroModel-WhisperTurboV3Spanish")
model = WhisperForConditionalGeneration.from_pretrained("IsmaelRR/SusurroModel-WhisperTurboV3Spanish")

# If you have GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Process your audio file
# Note: Make sure your audio is sampled at 16kHz
input_features = processor(
    audio["array"], 
    sampling_rate=16000, 
    return_tensors="pt"
).input_features.to(device)

# Generate transcription
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription)

Limitations

  • The model is specifically trained for Spanish language and may not perform well with other languages
  • Audio input should be sampled at 16kHz for optimal performance
  • Performance may vary with different audio qualities and accents

Training Infrastructure

  • Training framework: 🤗 Transformers
  • Python version: 3.8+
  • Key dependencies:
    • transformers
    • torch
    • datasets
    • numpy

Citation

If you use this model in your research, please cite:

@misc{susurro2024,
  author = {Your Name},
  title = {Susurro: Fine-tuned Whisper Model for Spanish Speech Recognition},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/IsmaelRR/SusurroModel-WhisperTurboV3Spanish}}
}

License

MIT

Acknowledgements

This model builds upon the OpenAI Whisper model and was trained using the Hugging Face Transformers library. Special thanks to the open-source community and contributors.

Downloads last month
6
Safetensors
Model size
809M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for IsmaelRR/SusurroModel-WhisperTurboV3Spanish

Finetuned
(146)
this model