FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Abstract
FuseCodec unifies acoustic, semantic, and contextual representations in speech tokenization through cross-modal alignment and global supervision, achieving state-of-the-art performance in transcription and synthesis.
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
Community
FuseCodec proposes a speech tokenization framework that fuses semantic and contextual information into neural codecs using three novel strategiesโlatent fusion, global supervision, and temporal alignmentโto enable state-of-the-art discrete speech representations for tasks like zero-shot TTS.
โก๏ธ ๐๐๐ฒ ๐๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐ ๐จ๐ฎ๐ซ ๐๐ฎ๐ฅ๐ญ๐ข๐ฆ๐จ๐๐๐ฅ ๐๐จ๐ค๐๐ง๐ข๐ณ๐๐ญ๐ข๐จ๐ง ๐ ๐ซ๐๐ฆ๐๐ฐ๐จ๐ซ๐ค:
๐ง ๐ณ๐๐๐๐๐ ๐น๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ญ๐๐๐๐๐ (FuseCodec-Fusion):
FuseCodec fuses semantic representations from self-supervised speech models and contextual representations from pretrained language models directly into the encoderโs latent space using cross-modal multi-head attention followed by additive fusion with stochastic dropout, enabling robust and unified token learning without modifying the core codec architecture.
๐ก ๐ฎ๐๐๐๐๐ ๐บ๐๐๐๐๐๐๐-๐ช๐๐๐๐๐๐๐๐๐ ๐บ๐๐๐๐๐๐๐๐๐๐ (FuseCodec-Distill):
FuseCodec supervises the quantized RVQ tokens by aligning them with global pooled representations from both modalities using cosine-based timestep-wise distillation loss, which enhances temporal coherence, linguistic grounding, and perceptual quality across the quantized space, outperforming existing tokenizers in WER, PESQ, and UTMOS.
๐งญ ๐ป๐๐๐๐๐๐๐๐๐ ๐จ๐๐๐๐๐๐
๐ช๐๐๐๐๐๐๐๐๐ ๐บ๐๐๐๐๐๐๐๐๐๐ (FuseCodec-ContextAlign):
By dynamically aligning contextual embeddings to speech tokens through a content-similarity-based windowed matching algorithm, FuseCodec enforces fine-grained cross-modal alignment via cosine supervision, improving intelligibility and interpretability while preserving local linguistic structure without degrading audio quality.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding (2025)
- SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec (2025)
- DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models (2025)
- HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling (2025)
- Entropy-based Coarse and Compressed Semantic Speech Representation Learning (2025)
- TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling (2025)
- Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper