arxiv:2509.11425

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Published on Sep 14

· Submitted by

Aman Chadha on Sep 16

Upvote

Authors:

Aman Chadha ,

Abstract

FuseCodec unifies acoustic, semantic, and contextual representations in speech tokenization through cross-modal alignment and global supervision, achieving state-of-the-art performance in transcription and synthesis.

AI-generated summary

Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter 17 days ago

FuseCodec proposes a speech tokenization framework that fuses semantic and contextual information into neural codecs using three novel strategies—latent fusion, global supervision, and temporal alignment—to enable state-of-the-art discrete speech representations for tasks like zero-shot TTS.

➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐨𝐮𝐫 𝐌𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐓𝐨𝐤𝐞𝐧𝐢𝐳𝐚𝐭𝐢𝐨𝐧 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤:

🧠 𝑳𝒂𝒕𝒆𝒏𝒕 𝑹𝒆𝒑𝒓𝒆𝒔𝒆𝒏𝒕𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒔𝒊𝒐𝒏 (FuseCodec-Fusion):
FuseCodec fuses semantic representations from self-supervised speech models and contextual representations from pretrained language models directly into the encoder’s latent space using cross-modal multi-head attention followed by additive fusion with stochastic dropout, enabling robust and unified token learning without modifying the core codec architecture.

📡 𝑮𝒍𝒐𝒃𝒂𝒍 𝑺𝒆𝒎𝒂𝒏𝒕𝒊𝒄-𝑪𝒐𝒏𝒕𝒆𝒙𝒕𝒖𝒂𝒍 𝑺𝒖𝒑𝒆𝒓𝒗𝒊𝒔𝒊𝒐𝒏 (FuseCodec-Distill):
FuseCodec supervises the quantized RVQ tokens by aligning them with global pooled representations from both modalities using cosine-based timestep-wise distillation loss, which enhances temporal coherence, linguistic grounding, and perceptual quality across the quantized space, outperforming existing tokenizers in WER, PESQ, and UTMOS.

🧭 𝑻𝒆𝒎𝒑𝒐𝒓𝒂𝒍𝒍𝒚 𝑨𝒍𝒊𝒈𝒏𝒆𝒅 𝑪𝒐𝒏𝒕𝒆𝒙𝒕𝒖𝒂𝒍 𝑺𝒖𝒑𝒆𝒓𝒗𝒊𝒔𝒊𝒐𝒏 (FuseCodec-ContextAlign):
By dynamically aligning contextual embeddings to speech tokens through a content-similarity-based windowed matching algorithm, FuseCodec enforces fine-grained cross-modal alignment via cosine supervision, improving intelligibility and interpretability while preserving local linguistic structure without degrading audio quality.

librarian-bot

17 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.11425 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.11425 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.11425 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.