Phoneme-based Speech Recognition Experiments

This section covers experiments with phoneme-level approaches for common and handicapped speech recognition using Whisper encoder representations.

4.4 Phoneme Classification Head

Method

Added simple classification heads to frozen Whisper encoder outputs for direct phoneme prediction. Tested various architectures from linear layers to BiLSTM networks. Used Montreal Forced Aligner (MFA) for phoneme target generation.

Results

Architecture	Parameters	Test CER (%)	Training State
4 FC layers (frozen encoder)	1.2M	104.11	Underfitting
3 BiLSTM + 4 FC (frozen)	21M	39.77	Best
3 BiLSTM + 4 FC (trainable)	656M	99.93	Overfitting

Key Finding: Frozen encoder consistently outperforms trainable encoder, achieving ~40% CER with BiLSTM architecture.

4.5 Phoneme Decoder

Method

Trained phoneme-aware decoder with custom tokenizer on subword phoneme tokens. Maintained pretrained decoder weights for stability.

Results

Version	Configuration	Test CER (%)	Notes
v10	Baseline (frozen encoder)	11.66	Strong baseline
v13	Encoder trainable	78.95	Catastrophic failure
v17	Complex vowels + regularization	11.78	Optimal

Key Finding: Encoder training causes severe degradation; frozen encoder with proper regularization achieves ~12% CER.

4.6 Dual Decoder Network

Method

Novel architecture with separate P-GPT (phoneme) and S-GPT (syllable) decoders attached to frozen Whisper encoder. Tested various training strategies and architectural modifications.

Results

Training Strategy Comparison

Version	Configuration	Phoneme CER (%)	Syllable CER (%)
v4	λ-weighted + alignment loss	11.84	13.04
v8	Ground truth text training	3.82	4.36
v17	Optimized tokenization	2.52	2.79
v23	Top-4 encoder trainable	6.04	70.37

Architecture Ablations

Component	Phoneme CER (%)	Impact
Baseline (32 layers)	2.52	Reference
Simple embedding replacement	80.13	Catastrophic
24 layers + pretrained	93.20	Severe degradation
24 layers from scratch	8.81	Acceptable

Multi-stage Training

Stage	Strategy	Phoneme CER (%)	Syllable CER (%)
End-to-end	Baseline	2.52	2.79
Multi-stage	Sequential training	1.96	2.02

Improvement: 22% phoneme, 28% syllable error reduction

Time Normalization Algorithm

Version	Amplification Factor	Normal Spec CER (%)
Original + Noise Augment	-	9.86
Original	-	11.51
1	2	9.93
Random	Random	9.97
8 (Updated Algo)	2	12.77
9 (8 + Random Noise)	2	15.33
10	Random	14.07

Improvement: 9.93% CER from original 11.51 for handicapped speaker

Key Insights

🔒 Frozen Encoder Principle

All experiments confirm that frozen Whisper encoder dramatically outperforms trainable encoder across all architectures:

Classification head: 39.77% vs 99.93% CER
Phoneme decoder: 11.78% vs 78.95% CER
Dual decoder: 2.52% vs 70.37% CER

🏆 Best Performance

Dual decoder with multi-stage training achieves:

1.96% phoneme CER
2.02% syllable CER
Represents state-of-the-art for phoneme-level handicapped speech recognition

⚠️ Critical Dependencies

Full architecture required: Removing encoder layers causes severe degradation
Pretrained weights essential: Simple embeddings cannot replace transformer encoder
Text-based training: Ground truth text outperforms phoneme conversions

📊 Performance Hierarchy

Dual Decoder (1.96% CER) - Best overall
Phoneme Decoder (11.78% CER) - Good balance
Classification Head (39.77% CER) - Simplest approach

Technical Notes

Cross-fold validation shows high variance (39-67% CER), indicating speaker dependency
Attention mechanisms cause training instability in classification tasks
Residual connections crucial for expert-based architectures
Proper tokenization and label consistency are critical for CTC training

Downloads last month: 4

Safetensors

Model size

858M params

Tensor type

F32

Model tree for neoALI/VC-SyllableWhisper-zeroth-vm1-stage4

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(357)

this model

neoALI
/

VC-SyllableWhisper-zeroth-vm1-stage4

Phoneme-based Speech Recognition Experiments

4.4 Phoneme Classification Head

Method

Results

4.5 Phoneme Decoder

Method

Results

4.6 Dual Decoder Network

Method

Results

Training Strategy Comparison

Architecture Ablations

Multi-stage Training

Time Normalization Algorithm

Improvement: 9.93% CER from original 11.51 for handicapped speaker

Key Insights

🔒 Frozen Encoder Principle

🏆 Best Performance

⚠️ Critical Dependencies

📊 Performance Hierarchy

Technical Notes

Model tree for neoALI/VC-SyllableWhisper-zeroth-vm1-stage4

Datasets used to train neoALI/VC-SyllableWhisper-zeroth-vm1-stage4