metadata

license: mit
language:
  - en
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
library_name: transformers
tags:
  - medical
  - speech
  - asr
base_model:
  - openai/whisper-large-v3
  - google/medgemma-4b-it

EkaCare Parrotlet-a-en-5b

This is a purpose-built automatic speech recognition (ASR) model specifically trained for english speech in Indian healthcare setting optimised for transcribing medical speech. This model combines Whisper V3 large encoder and MedGemma3 4B decoder through a lean projector layer for efficient speech-to-text conversion.

A detailed description of this model can be obtained from this blog post.

Installation Requirements

To use this model, you need to install the following dependencies:

Python 3.10 and the following packages using pip:

pip install torchaudio>=2.7.0 torch>=2.7.0 transformers>=4.52.0 librosa soundfile python-dotenv huggingface_hub

Below is an example of how to load and use this model for automatic speech recognition using the Hugging Face Transformers library.

Loading the model from Hugging Face Hub

from transformers import AutoModel
import librosa


repo_name = "ekacare/parrotlet-a-en-5b"
model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)

Load an audio file

audio_path = "path/to/your/audio.mp3"
audio, sample_rate = librosa.load(audio_path, sr=16000)  # Resample to 16kHz if needed

Perform speech recognition

transcription = model.transcribe(audio, sample_rate)
print("Transcription:", transcription)

Notes

Ensure the audio input is in compatible format (wav, mp3) with a 16kHz sampling rate for optimal performance.
This model handles short form audio with chunk size smaller than 30 seconds.

Authentication (if required)

Set up your Hugging Face token (if required):

Log in to your Hugging Face account and generate an access token at Hugging Face Settings. Set the token in your environment:

export HF_TOKEN="your-access-token"

Alternatively, use the Hugging Face CLI to log in:

huggingface-cli login

License

This model is released under the MIT License, enabling broad use while maintaining attribution requirements.