reward-model
This model is a Reward Model trained on the RobotsMali transcription scorer dataset. It achieves the following results on the evaluation set:
- Loss: 0.0609
- R2: 0.5447
- Pearson: 0.7406
Model description
This model is a Reward Model trained on the RobotsMali transcription scorer dataset, where the scores were assigned by human annotators. It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.
The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.
Intended uses & limitations
Intended uses
- Evaluate the quality of an ASR transcription against audio, producing a continuous score in [0,1].
- Integrate as a Reward Model in RLHF (Reinforcement Learning from Human Feedback) pipelines for fine-tuning ASR models.
- Automatically compare transcriptions generated by different ASR systems or models.
- Serve as a reference-free proxy metric for ASR, allowing approximate quality evaluation without requiring reference transcriptions.
Limitations
- Sensitive to accents, background noise, or pronunciation variations not represented in the RobotsMali dataset.
- Scores are based on rules defined by our team, rather than purely subjective judgment, and reflect the specific scoring criteria we established for the dataset.
Training Procedure
Audio Encoder
Input: Raw waveform (16 kHz)
Feature extraction: Mel-spectrogram using the processor of RobotsMali's STT-BM-QuartzNet15x5-V0 model
Architecture:
- 1D Convolutional layers:
audio_conv_layersΓ (Conv1d β BatchNorm1d β ReLU) - Channels:
audio_conv_channels(input channels = 64, kernel size =kernel_size, stride =stride, padding =padding) - Adaptive Average Pooling over time β output dimension =
audio_conv_channels
Text Encoder
Input: Tokenized transcription (IDs from SentencePiece tokenizer)
Architecture:
- Embedding layer:
embed_dim(vocab_size =vocab_size, padding_idx =pad_token_id) - Bidirectional LSTM: hidden size =
lstm_hidden, layers =lstm_layers - Sequence pooling: masked mean pooling over sequence length β output dimension =
2 * lstm_hidden
Fusion & Regression Head
Fusion: Concatenate [audio_emb, text_emb] β combined_dim = audio_conv_channels + 2 * lstm_hidden
Regression head:
- Linear(combined_dim β
head_hidden) β ReLU β Dropout(dropout) - Linear(
head_hiddenβhead_hidden) β ReLU - Linear(
head_hiddenβ 1) β Sigmoid
Output: Scalar β [0, 1] (predicted reward score)
Objective
- Loss: Mean Squared Error (MSE)
- Goal: Predict similarity between spoken audio and its transcription
| Parameter | Value |
|---|---|
audio_conv_layers |
3 |
audio_conv_channels |
128 |
kernel_size |
5 |
stride |
1 |
padding |
2 |
embed_dim |
128 |
vocab_size |
2048 |
lstm_hidden |
128 |
lstm_layers |
1 |
head_hidden |
256 |
dropout |
0.1 |
pad_token_id |
1 |
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 10
Training results
| Training Loss | Epoch | Step | Validation Loss | Mse | R2 | Pearson |
|---|---|---|---|---|---|---|
| 0.1237 | 0.8 | 100 | 0.1100 | 0.1100 | 0.1781 | 0.5916 |
| 0.0675 | 1.6 | 200 | 0.0723 | 0.0723 | 0.4597 | 0.6906 |
| 0.0562 | 2.4 | 300 | 0.0684 | 0.0684 | 0.4890 | 0.7094 |
| 0.0625 | 3.2 | 400 | 0.0650 | 0.0650 | 0.5145 | 0.7175 |
| 0.0563 | 4.0 | 500 | 0.0662 | 0.0662 | 0.5055 | 0.7120 |
| 0.0478 | 4.8 | 600 | 0.0616 | 0.0616 | 0.5396 | 0.7398 |
| 0.0454 | 5.6 | 700 | 0.0634 | 0.0634 | 0.5266 | 0.7264 |
| 0.0429 | 6.4 | 800 | 0.0607 | 0.0607 | 0.5467 | 0.7404 |
| 0.0422 | 7.2 | 900 | 0.0615 | 0.0615 | 0.5405 | 0.7429 |
| 0.0421 | 8.0 | 1000 | 0.0622 | 0.0622 | 0.5353 | 0.7338 |
| 0.0423 | 8.8 | 1100 | 0.0610 | 0.0610 | 0.5446 | 0.7424 |
| 0.0485 | 9.6 | 1200 | 0.0610 | 0.0610 | 0.5445 | 0.7416 |
Framework versions
- Transformers 4.53.3
- Pytorch 2.9.0+cu128
- Datasets 3.3.2
- Tokenizers 0.21.4
Example Usage
First, install our package
pip install git+https://github.com/diarray-hub/bambara-asr.git@rlnf-v2-gpu
import torch
from RLNF.Rewards.reward_model import RewardModel
from RLNF.Rewards.reward_processor import RewardModelProcessor
from RLNF.Rewards.reward_feature_extraction import RewardFeatureExtractor
from transformers import T5Tokenizer
from nemo.collections.asr.models import EncDecCTCModel
audios = ["1.wav", "2.wav"]
texts = ["kelen", "fila."]
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer : T5Tokenizer = T5Tokenizer.from_pretrained("Panga-Azazia/reward-model")
asr_model : EncDecCTCModel= EncDecCTCModel.from_pretrained("RobotsMali/stt-bm-quartznet15x5-V0")
feature_extractor : RewardFeatureExtractor = RewardFeatureExtractor(asr_model)
processor : RewardModelProcessor = RewardModelProcessor(feature_extractor, tokenizer)
model : RewardModel = RewardModel.from_pretrained("Panga-Azazia/reward-model")
model.eval()
model.to(device)
out = processor(audios=audios, texts=texts)
out = {k: v.to(device) if torch.is_tensor(v) else v for k, v in out.items()}
with torch.no_grad() :
preds = model(**out).logits
for i, (t, val) in enumerate(zip(texts, preds)):
print(f"Audio : {audios[i]:<10} | Text: {t:<10} | Score: {val.item() * 100:.4f}")
- Downloads last month
- 81