Model Card for Hibiki
Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- Hibiki adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. As the user speaks, Hibiki generates natural speech in the target language, optionally with voice transfer, along with a text translation. Hibiki currently only supports French-to-English translation.
Model Details
This is the model simply referred to as Hibiki in our paper, a 2.7B parameter hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a 2.2kbps bitrate.
Model Description
Hibiki is a decoder-only model for simultaneous speech translation. Hibiki leverages the multistream architecture of Moshi to model source and target speech jointly. This allows Hibiki to continuously process the input stream while generating the target speech. Hibiki produces text and audio tokens at a constant framerate of 12.5Hz. This allows for a continuous output audio stream, along with timestamped text tranlsation. Since Hibiki relies on simple temperature sampling, it is compatible with batching unlike models that rely on complex inference policies. Moreover, the fidelity of Hibiki's voice transfer can be controlled by changing the coefficient of the Classifier-Free Guidance: a larger coefficient will increase voice similarity, but excessive coefficients can lead to worse translations.
- Developed by: Kyutai
- Model type: Simultaneous speech-to-speech and speech-to-text translation.
- Language(s) (NLP): French-to-English
- License: CC-BY
Model Sources
Uses
Direct Use
The model can be used for streaming translation from French to English in real-time settings, or for batched simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up to 120 seconds.
Downstream Use
Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki architecture, supporting other pairs of languages would require finetuning.
Out-of-Scope Use
The model is not intended to be used to impersonate other people or any malicious use of any kind.
How to Get Started with the Model
See the main README file.
Training Details
Training Data
Textual data: The underlying Helium model is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
Audio data
- Unsupervised audio dataset: used for pre-training, this is a collection of 7M hours of readily available audio content in English and 450k hours in French, following the preprocessing and recipe of Moshi.
- Synthetic translation dataset: Around 40k hours of parallel French-English data synthesized with contextual alignment (see Section 3.2) with various levels of speaker similarity.
- Translation finetuning: A 900 hours mixture of a resynthesized version of CVSS-T and synthetic long-form utterances.
Training procedure and hyper-parameters
The different stages of the training procedure are detailled in the paper along with the hyper-parameters.
Compute Infrastructure
The final model was trained on 48 H100 Nvidia GPUs.
Citation
@misc{labiausse2025hibiki,
title={High-Fidelity Simultaneous Speech-To-Speech Translation},
author={Tom Labiausse and Laurent Mazaré and Edouard Grave and Patrick Pérez and Alexandre Défossez and Neil Zeghidour},
year={2025},
eprint={2502.03382},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.03382},
}
Model Card Authors
Tom Labiausse, Laurent Mazaré, Edouard Grave, Patrick Pérez, Alexandre Défossez, Neil Zeghidour
- Downloads last month
- 0