Papers
arxiv:2509.11425

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Published on Sep 14
ยท Submitted by Aman Chadha on Sep 16
Authors:
,
,
,
,
,
,
,

Abstract

FuseCodec unifies acoustic, semantic, and contextual representations in speech tokenization through cross-modal alignment and global supervision, achieving state-of-the-art performance in transcription and synthesis.

AI-generated summary

Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

Community

Paper author Paper submitter

FuseCodec proposes a speech tokenization framework that fuses semantic and contextual information into neural codecs using three novel strategiesโ€”latent fusion, global supervision, and temporal alignmentโ€”to enable state-of-the-art discrete speech representations for tasks like zero-shot TTS.

โžก๏ธ ๐Š๐ž๐ฒ ๐‡๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐Ÿ ๐จ๐ฎ๐ซ ๐Œ๐ฎ๐ฅ๐ญ๐ข๐ฆ๐จ๐๐š๐ฅ ๐“๐จ๐ค๐ž๐ง๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐…๐ซ๐š๐ฆ๐ž๐ฐ๐จ๐ซ๐ค:

๐Ÿง  ๐‘ณ๐’‚๐’•๐’†๐’๐’• ๐‘น๐’†๐’‘๐’“๐’†๐’”๐’†๐’๐’•๐’‚๐’•๐’Š๐’๐’ ๐‘ญ๐’–๐’”๐’Š๐’๐’ (FuseCodec-Fusion):
FuseCodec fuses semantic representations from self-supervised speech models and contextual representations from pretrained language models directly into the encoderโ€™s latent space using cross-modal multi-head attention followed by additive fusion with stochastic dropout, enabling robust and unified token learning without modifying the core codec architecture.

๐Ÿ“ก ๐‘ฎ๐’๐’๐’ƒ๐’‚๐’ ๐‘บ๐’†๐’Ž๐’‚๐’๐’•๐’Š๐’„-๐‘ช๐’๐’๐’•๐’†๐’™๐’•๐’–๐’‚๐’ ๐‘บ๐’–๐’‘๐’†๐’“๐’—๐’Š๐’”๐’Š๐’๐’ (FuseCodec-Distill):
FuseCodec supervises the quantized RVQ tokens by aligning them with global pooled representations from both modalities using cosine-based timestep-wise distillation loss, which enhances temporal coherence, linguistic grounding, and perceptual quality across the quantized space, outperforming existing tokenizers in WER, PESQ, and UTMOS.

๐Ÿงญ ๐‘ป๐’†๐’Ž๐’‘๐’๐’“๐’‚๐’๐’๐’š ๐‘จ๐’๐’Š๐’ˆ๐’๐’†๐’… ๐‘ช๐’๐’๐’•๐’†๐’™๐’•๐’–๐’‚๐’ ๐‘บ๐’–๐’‘๐’†๐’“๐’—๐’Š๐’”๐’Š๐’๐’ (FuseCodec-ContextAlign):
By dynamically aligning contextual embeddings to speech tokens through a content-similarity-based windowed matching algorithm, FuseCodec enforces fine-grained cross-modal alignment via cosine supervision, improving intelligibility and interpretability while preserving local linguistic structure without degrading audio quality.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.11425 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.11425 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.11425 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.