arxiv:2509.17349

Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Published on Sep 22

· Submitted by

Sara Papi on Sep 24

Upvote

Authors:

Peter Polák ,

Sara Papi ,

Luisa Bentivogli ,

Ondřej Bojar

Abstract

The paper analyzes SimulST latency metrics, identifies segmentation bias, and introduces YAAL and LongYAAL for more accurate latency evaluation, along with SoftSegmenter for improved alignment quality.

AI-generated summary

Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency--the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.

View arXiv page View PDF Add to collection

Community

spapi

Paper author Paper submitter 9 days ago

This work makes three primary contributions to SimulST evaluation. First, it presents the first comprehensive analysis identifying a systematic bias in existing latency metrics. Second, it introduces the YAAL and LongYAAL metrics to correct this bias. Third, it provides SOFTSEGMENTER, a novel re-segmentation tool that improves alignment for long-form audio, creating a complete and more reliable assessment framework.

librarian-bot

9 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.17349 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.17349 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.17349 in a Space README.md to link it from this page.