Phoneme-based Speech Recognition Experiments
This section covers experiments with phoneme-level approaches for common and handicapped speech recognition using Whisper encoder representations.
4.4 Phoneme Classification Head
Method
Added simple classification heads to frozen Whisper encoder outputs for direct phoneme prediction. Tested various architectures from linear layers to BiLSTM networks. Used Montreal Forced Aligner (MFA) for phoneme target generation.
Results
Architecture | Parameters | Test CER (%) | Training State |
---|---|---|---|
4 FC layers (frozen encoder) | 1.2M | 104.11 | Underfitting |
3 BiLSTM + 4 FC (frozen) | 21M | 39.77 | Best |
3 BiLSTM + 4 FC (trainable) | 656M | 99.93 | Overfitting |
Key Finding: Frozen encoder consistently outperforms trainable encoder, achieving ~40% CER with BiLSTM architecture.
4.5 Phoneme Decoder
Method
Trained phoneme-aware decoder with custom tokenizer on subword phoneme tokens. Maintained pretrained decoder weights for stability.
Results
Version | Configuration | Test CER (%) | Notes |
---|---|---|---|
v10 | Baseline (frozen encoder) | 11.66 | Strong baseline |
v13 | Encoder trainable | 78.95 | Catastrophic failure |
v17 | Complex vowels + regularization | 11.78 | Optimal |
Key Finding: Encoder training causes severe degradation; frozen encoder with proper regularization achieves ~12% CER.
4.6 Dual Decoder Network
Method
Novel architecture with separate P-GPT (phoneme) and S-GPT (syllable) decoders attached to frozen Whisper encoder. Tested various training strategies and architectural modifications.
Results
Training Strategy Comparison
Version | Configuration | Phoneme CER (%) | Syllable CER (%) |
---|---|---|---|
v4 | Ξ»-weighted + alignment loss | 11.84 | 13.04 |
v8 | Ground truth text training | 3.82 | 4.36 |
v17 | Optimized tokenization | 2.52 | 2.79 |
v23 | Top-4 encoder trainable | 6.04 | 70.37 |
Architecture Ablations
Component | Phoneme CER (%) | Impact |
---|---|---|
Baseline (32 layers) | 2.52 | Reference |
Simple embedding replacement | 80.13 | Catastrophic |
24 layers + pretrained | 93.20 | Severe degradation |
24 layers from scratch | 8.81 | Acceptable |
Multi-stage Training
Stage | Strategy | Phoneme CER (%) | Syllable CER (%) |
---|---|---|---|
End-to-end | Baseline | 2.52 | 2.79 |
Multi-stage | Sequential training | 1.96 | 2.02 |
Improvement: 22% phoneme, 28% syllable error reduction
Time Normalization Algorithm
Version | Amplification Factor | Normal Spec CER (%) |
---|---|---|
Original + Noise Augment | - | 9.86 |
Original | - | 11.51 |
1 | 2 | 9.93 |
Random | Random | 9.97 |
8 (Updated Algo) | 2 | 12.77 |
9 (8 + Random Noise) | 2 | 15.33 |
10 | Random | 14.07 |
Improvement: 9.93% CER from original 11.51 for handicapped speaker
Key Insights
π Frozen Encoder Principle
All experiments confirm that frozen Whisper encoder dramatically outperforms trainable encoder across all architectures:
- Classification head: 39.77% vs 99.93% CER
- Phoneme decoder: 11.78% vs 78.95% CER
- Dual decoder: 2.52% vs 70.37% CER
π Best Performance
Dual decoder with multi-stage training achieves:
- 1.96% phoneme CER
- 2.02% syllable CER
- Represents state-of-the-art for phoneme-level handicapped speech recognition
β οΈ Critical Dependencies
- Full architecture required: Removing encoder layers causes severe degradation
- Pretrained weights essential: Simple embeddings cannot replace transformer encoder
- Text-based training: Ground truth text outperforms phoneme conversions
π Performance Hierarchy
- Dual Decoder (1.96% CER) - Best overall
- Phoneme Decoder (11.78% CER) - Good balance
- Classification Head (39.77% CER) - Simplest approach
Technical Notes
- Cross-fold validation shows high variance (39-67% CER), indicating speaker dependency
- Attention mechanisms cause training instability in classification tasks
- Residual connections crucial for expert-based architectures
- Proper tokenization and label consistency are critical for CTC training
- Downloads last month
- 9
Model tree for neoALI/VC-SyllableWhisper-zeroth-vm1-stage4
Base model
openai/whisper-large-v3