Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation
Abstract
The paper analyzes SimulST latency metrics, identifies segmentation bias, and introduces YAAL and LongYAAL for more accurate latency evaluation, along with SoftSegmenter for improved alignment quality.
Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.
Community
This work makes three primary contributions to SimulST evaluation. First, it presents the first comprehensive analysis identifying a systematic bias in existing latency metrics. Second, it introduces the YAAL and LongYAAL metrics to correct this bias. Third, it provides SOFTSEGMENTER, a novel re-segmentation tool that improves alignment for long-form audio, creating a complete and more reliable assessment framework.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Extending Automatic Machine Translation Evaluation to Book-Length Documents (2025)
- SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation (2025)
- Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT (2025)
- SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation (2025)
- Direct Simultaneous Translation Activation for Large Audio-Language Models (2025)
- DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment (2025)
- VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper