Phoneme-based Speech Recognition Experiments

This section covers experiments with phoneme-level approaches for common and handicapped speech recognition using Whisper encoder representations.


4.4 Phoneme Classification Head

Method

Added simple classification heads to frozen Whisper encoder outputs for direct phoneme prediction. Tested various architectures from linear layers to BiLSTM networks. Used Montreal Forced Aligner (MFA) for phoneme target generation.

Results

Architecture Parameters Test CER (%) Training State
4 FC layers (frozen encoder) 1.2M 104.11 Underfitting
3 BiLSTM + 4 FC (frozen) 21M 39.77 Best
3 BiLSTM + 4 FC (trainable) 656M 99.93 Overfitting

Key Finding: Frozen encoder consistently outperforms trainable encoder, achieving ~40% CER with BiLSTM architecture.


4.5 Phoneme Decoder

Method

Trained phoneme-aware decoder with custom tokenizer on subword phoneme tokens. Maintained pretrained decoder weights for stability.

Results

Version Configuration Test CER (%) Notes
v10 Baseline (frozen encoder) 11.66 Strong baseline
v13 Encoder trainable 78.95 Catastrophic failure
v17 Complex vowels + regularization 11.78 Optimal

Key Finding: Encoder training causes severe degradation; frozen encoder with proper regularization achieves ~12% CER.


4.6 Dual Decoder Network

Method

Novel architecture with separate P-GPT (phoneme) and S-GPT (syllable) decoders attached to frozen Whisper encoder. Tested various training strategies and architectural modifications.

Results

Training Strategy Comparison

Version Configuration Phoneme CER (%) Syllable CER (%)
v4 Ξ»-weighted + alignment loss 11.84 13.04
v8 Ground truth text training 3.82 4.36
v17 Optimized tokenization 2.52 2.79
v23 Top-4 encoder trainable 6.04 70.37

Architecture Ablations

Component Phoneme CER (%) Impact
Baseline (32 layers) 2.52 Reference
Simple embedding replacement 80.13 Catastrophic
24 layers + pretrained 93.20 Severe degradation
24 layers from scratch 8.81 Acceptable

Multi-stage Training

Stage Strategy Phoneme CER (%) Syllable CER (%)
End-to-end Baseline 2.52 2.79
Multi-stage Sequential training 1.96 2.02

Improvement: 22% phoneme, 28% syllable error reduction

Time Normalization Algorithm

Version Amplification Factor Normal Spec CER (%)
Original + Noise Augment - 9.86
Original - 11.51
1 2 9.93
Random Random 9.97
8 (Updated Algo) 2 12.77
9 (8 + Random Noise) 2 15.33
10 Random 14.07

Improvement: 9.93% CER from original 11.51 for handicapped speaker

Key Insights

πŸ”’ Frozen Encoder Principle

All experiments confirm that frozen Whisper encoder dramatically outperforms trainable encoder across all architectures:

  • Classification head: 39.77% vs 99.93% CER
  • Phoneme decoder: 11.78% vs 78.95% CER
  • Dual decoder: 2.52% vs 70.37% CER

πŸ† Best Performance

Dual decoder with multi-stage training achieves:

  • 1.96% phoneme CER
  • 2.02% syllable CER
  • Represents state-of-the-art for phoneme-level handicapped speech recognition

⚠️ Critical Dependencies

  • Full architecture required: Removing encoder layers causes severe degradation
  • Pretrained weights essential: Simple embeddings cannot replace transformer encoder
  • Text-based training: Ground truth text outperforms phoneme conversions

πŸ“Š Performance Hierarchy

  1. Dual Decoder (1.96% CER) - Best overall
  2. Phoneme Decoder (11.78% CER) - Good balance
  3. Classification Head (39.77% CER) - Simplest approach

Technical Notes

  • Cross-fold validation shows high variance (39-67% CER), indicating speaker dependency
  • Attention mechanisms cause training instability in classification tasks
  • Residual connections crucial for expert-based architectures
  • Proper tokenization and label consistency are critical for CTC training
Downloads last month
9
Safetensors
Model size
858M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for neoALI/VC-SyllableWhisper-zeroth-vm1-stage4

Finetuned
(348)
this model

Datasets used to train neoALI/VC-SyllableWhisper-zeroth-vm1-stage4