reward-model

This model is a Reward Model trained on the RobotsMali transcription scorer dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0609
  • R2: 0.5447
  • Pearson: 0.7406

Model description

This model is a Reward Model trained on the RobotsMali transcription scorer dataset, where the scores were assigned by human annotators. It predicts a continuous score between 0 and 1 for a pair (audio, text), representing how well the text matches the spoken audio.

The model can be integrated as a Reward Model within RLHF pipelines to evaluate or fine-tune ASR models based on human preference scores.

Intended uses & limitations

Intended uses

  • Evaluate the quality of an ASR transcription against audio, producing a continuous score in [0,1].
  • Integrate as a Reward Model in RLHF (Reinforcement Learning from Human Feedback) pipelines for fine-tuning ASR models.
  • Automatically compare transcriptions generated by different ASR systems or models.
  • Serve as a reference-free proxy metric for ASR, allowing approximate quality evaluation without requiring reference transcriptions.

Limitations

  • Sensitive to accents, background noise, or pronunciation variations not represented in the RobotsMali dataset.
  • Scores are based on rules defined by our team, rather than purely subjective judgment, and reflect the specific scoring criteria we established for the dataset.

Training Procedure

Audio Encoder

Input: Raw waveform (16 kHz)
Feature extraction: Mel-spectrogram using the processor of RobotsMali's STT-BM-QuartzNet15x5-V0 model

Architecture:

  • 1D Convolutional layers: audio_conv_layers Γ— (Conv1d β†’ BatchNorm1d β†’ ReLU)
  • Channels: audio_conv_channels (input channels = 64, kernel size = kernel_size, stride = stride, padding = padding)
  • Adaptive Average Pooling over time β†’ output dimension = audio_conv_channels

Text Encoder

Input: Tokenized transcription (IDs from SentencePiece tokenizer)

Architecture:

  • Embedding layer: embed_dim (vocab_size = vocab_size, padding_idx = pad_token_id)
  • Bidirectional LSTM: hidden size = lstm_hidden, layers = lstm_layers
  • Sequence pooling: masked mean pooling over sequence length β†’ output dimension = 2 * lstm_hidden

Fusion & Regression Head

Fusion: Concatenate [audio_emb, text_emb] β†’ combined_dim = audio_conv_channels + 2 * lstm_hidden

Regression head:

  • Linear(combined_dim β†’ head_hidden) β†’ ReLU β†’ Dropout(dropout)
  • Linear(head_hidden β†’ head_hidden) β†’ ReLU
  • Linear(head_hidden β†’ 1) β†’ Sigmoid

Output: Scalar ∈ [0, 1] (predicted reward score)


Objective

  • Loss: Mean Squared Error (MSE)
  • Goal: Predict similarity between spoken audio and its transcription
Parameter Value
audio_conv_layers 3
audio_conv_channels 128
kernel_size 5
stride 1
padding 2
embed_dim 128
vocab_size 2048
lstm_hidden 128
lstm_layers 1
head_hidden 256
dropout 0.1
pad_token_id 1

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss Mse R2 Pearson
0.1237 0.8 100 0.1100 0.1100 0.1781 0.5916
0.0675 1.6 200 0.0723 0.0723 0.4597 0.6906
0.0562 2.4 300 0.0684 0.0684 0.4890 0.7094
0.0625 3.2 400 0.0650 0.0650 0.5145 0.7175
0.0563 4.0 500 0.0662 0.0662 0.5055 0.7120
0.0478 4.8 600 0.0616 0.0616 0.5396 0.7398
0.0454 5.6 700 0.0634 0.0634 0.5266 0.7264
0.0429 6.4 800 0.0607 0.0607 0.5467 0.7404
0.0422 7.2 900 0.0615 0.0615 0.5405 0.7429
0.0421 8.0 1000 0.0622 0.0622 0.5353 0.7338
0.0423 8.8 1100 0.0610 0.0610 0.5446 0.7424
0.0485 9.6 1200 0.0610 0.0610 0.5445 0.7416

Framework versions

  • Transformers 4.53.3
  • Pytorch 2.9.0+cu128
  • Datasets 3.3.2
  • Tokenizers 0.21.4

Example Usage

First, install our package

pip install git+https://github.com/diarray-hub/bambara-asr.git@rlnf-v2-gpu
import torch
from RLNF.Rewards.reward_model import RewardModel
from RLNF.Rewards.reward_processor import RewardModelProcessor
from RLNF.Rewards.reward_feature_extraction import RewardFeatureExtractor
from transformers import T5Tokenizer
from nemo.collections.asr.models import EncDecCTCModel

audios = ["1.wav", "2.wav"]
texts = ["kelen", "fila."]

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer : T5Tokenizer = T5Tokenizer.from_pretrained("Panga-Azazia/reward-model")
asr_model : EncDecCTCModel= EncDecCTCModel.from_pretrained("RobotsMali/stt-bm-quartznet15x5-V0")
feature_extractor : RewardFeatureExtractor = RewardFeatureExtractor(asr_model)

processor : RewardModelProcessor = RewardModelProcessor(feature_extractor, tokenizer)

model : RewardModel = RewardModel.from_pretrained("Panga-Azazia/reward-model")

model.eval()
model.to(device)
    
out = processor(audios=audios, texts=texts)    
out = {k: v.to(device) if torch.is_tensor(v) else v for k, v in out.items()}


with torch.no_grad() :
  preds = model(**out).logits
    
    
for i, (t, val) in enumerate(zip(texts, preds)):
  print(f"Audio : {audios[i]:<10} | Text: {t:<10} | Score: {val.item() * 100:.4f}")
Downloads last month
81
Safetensors
Model size
898k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Panga-Azazia/reward-model