sxdxfan commited on
Commit
e23fc14
·
1 Parent(s): 04f2488

Add model card, acoustic model checkpoint in safetensors format, ONNX model, tokenizer configs and KenLM model

Browse files
Files changed (9) hide show
  1. README.md +160 -3
  2. added_tokens.json +5 -0
  3. config.json +117 -0
  4. kenlm.bin +3 -0
  5. model.onnx +3 -0
  6. model.safetensors +3 -0
  7. special_tokens_map.json +6 -0
  8. tokenizer_config.json +47 -0
  9. vocab.json +37 -0
README.md CHANGED
@@ -1,3 +1,160 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ru
5
+ pipeline_tag: automatic-speech-recognition
6
+ tags:
7
+ - conformer
8
+ - streaming
9
+ - asr
10
+ - stt
11
+ - telephony
12
+ - russian
13
+ - speech
14
+ - t-tech
15
+ - t-one
16
+ ---
17
+
18
+ # T-one: Streaming ASR for Russian Telephony
19
+
20
+ **🚀 T-one is a high-performance streaming ASR pipeline for Russian, specialized for the telephony domain.**
21
+
22
+ T-one provides a complete low-latency solution for real-time transcription. It features a pretrained streaming Conformer-based acoustic model, a custom phrase boundary detector and a decoder, making it a ready-to-use solution for production environments. It provides not only the pretrained model but also a full suite of tools for inference, fine-tuning, and deployment.
23
+
24
+ Developed by *T-Software DC*, this project is a practical low-latency, high-throughput ASR solution with modular components.
25
+
26
+ For more details, see the [**GitHub Repository**](https://github.com/voicekit-team/T-one).
27
+
28
+ ## Table of Contents
29
+ 1. [Project Summary](#-project-summary)
30
+ 2. [Quality benchmarks](#-quality-benchmarks)
31
+ 3. [Inference examples](#-inference-examples)
32
+ 4. [Fine-tuning](#-fine-tuning)
33
+ 5. [Acoustic model](#-acoustic-model)
34
+ 6. [Training details](#-training-details)
35
+ 7. [License](#-license)
36
+
37
+ ## 📝 Project Summary
38
+
39
+ **Key Features:**
40
+ - **Streaming-first Architecture:** Built for low-latency, real-time applications.
41
+ - **Ready-to-Use Pipeline:** Includes a pretrained acoustic model, phrase splitter, and a KenLM-based [**CTC**](https://huggingface.co/learn/audio-course/chapter3/ctc) beam search decoder with examples for offline and streaming speech recognition inference.
42
+ - **Demo** — launch a local speech recognition service instantly via Docker and transcribe audio files or real-time microphone input.
43
+ - **Fine-tuning** T-one on a custom dataset is straightforward using the 🤗 ecosystem.
44
+ - **Easy Deployment:** Includes examples for deploying with Triton Inference Server for high-throughput scenarios.
45
+ - **Fully Open Source architecture:** All model and pipeline code is available.
46
+
47
+ ## 📊 Quality benchmarks
48
+
49
+ **Word Error Rate ([WER](https://huggingface.co/spaces/evaluate-metric/wer))** is used to evaluate the quality of automatic speech recognition systems, which can be interpreted as the percentage of incorrectly recognized words compared to a reference transcript. A lower value indicates higher accuracy. T-one demonstrates state-of-the-art performance, especially on its target domain of telephony, while remaining competitive on general-purpose benchmarks.
50
+
51
+ | Category | T-one (70M) | GigaAM-RNNT v2 (243M) | GigaAM-CTC v2 (242M) | Vosk-model-ru 0.54 (65M) | Vosk-model-small-streaming-ru 0.54 (20M) | Whisper large-v3 (1540M) |
52
+ |:--|:--|:--|:--|--:|:--|:--|
53
+ | Call-center | **8.63** | 10.22 | 10.57 | 11.28 | 15.53 | 19.39 |
54
+ | Other telephony | **6.20** | 7.88 | 8.15 | 8.69 | 13.49 | 17.29 |
55
+ | Named entities | **5.83** | 9.55 | 9.81 | 12.12 | 17.65 | 17.87 |
56
+ | CommonVoice 19 (test split) | 5.32 | **2.68** | 3.14 | 6.22 | 11.3 | 5.78 |
57
+ | OpenSTT asr_calls_2_val original | 20.27 | **20.07** | 21.24 | 22.64 | 29.45 | 29.02 |
58
+ | OpenSTT asr_calls_2_val re-labeled | **7.94** | 11.14 | 12.43 | 13.22 | 21.03 | 20.82 |
59
+
60
+ ## 👨‍💻 Inference examples
61
+
62
+ ### Offline Inference (for entire audio files)
63
+ ```python
64
+ from tone import StreamingCTCPipeline, read_audio, read_example_audio
65
+
66
+
67
+ audio = read_example_audio() # or read_audio("your_audio.flac")
68
+
69
+ pipeline = StreamingCTCPipeline.from_hugging_face()
70
+ print(pipeline.forward_offline(audio)) # run offline recognition
71
+ ```
72
+ Output:
73
+ ```
74
+ [TextPhrase(text='привет', start_time=1.79, end_time=2.04), TextPhrase(text='это я', start_time=3.72, end_time=4.26), TextPhrase(text='я подумала не хочешь ли ты встретиться спустя все эти годы', start_time=5.88, end_time=10.59)]
75
+ ```
76
+
77
+ ### Streaming Inference (for real-time audio)
78
+
79
+ ```python
80
+ from tone import StreamingCTCPipeline, read_stream_example_audio
81
+
82
+
83
+ pipeline = StreamingCTCPipeline.from_hugging_face()
84
+
85
+ state = None # Current state of the ASR pipeline (None - initial)
86
+ for audio_chunk in read_stream_example_audio(): # Use any source of audio chunks
87
+ new_phrases, state = pipeline.forward(audio_chunk, state)
88
+ print(new_phrases)
89
+
90
+ # Finalize the pipeline and get the remaining phrases
91
+ new_phrases, _ = pipeline.finalize(state)
92
+ print(new_phrases)
93
+ ```
94
+ Output:
95
+ ```
96
+ TextPhrase(text='привет', start_time=1.79, end_time=2.04)
97
+ TextPhrase(text='это я', start_time=3.72, end_time=4.26)
98
+ TextPhrase(text='я подумала не хочешь ли ты встретиться спустя все эти годы', start_time=5.88, end_time=10.59)
99
+ ```
100
+
101
+ ## 🔧 Fine-tuning
102
+ In order to fine-tune T-one from a pre-trained checkpoint you need to prepare the training dataset, load the tokenizer and feature extractor from `t-tech/T-one` 🤗 repo.
103
+
104
+ ```python
105
+ import torch
106
+
107
+ from tone.training.model_wrapper import ToneForCTC
108
+
109
+
110
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
111
+ model = ToneForCTC.from_pretrained("t-tech/T-one").to(device)
112
+ ```
113
+
114
+ Setup the data collator, evaluation metric, training arguments and 🤗 Trainer.
115
+
116
+ For a complete guide please refer to the [**fine-tuning example notebook**](https://github.com/voicekit-team/T-one/blob/main/examples/finetune_example.ipynb).
117
+
118
+ ## 🎙 Acoustic model
119
+
120
+ ### Architecture
121
+ T-one is a 70M parameter acoustic model based on the **Conformer** architecture, with several key innovations to improve performance and efficiency:
122
+ - **SwiGLU Activation:** The feed-forward module is replaced with a SwiGLU module for better performance.
123
+ - **Modern Normalization:** SiLU (Swish) activations and RMSNorm are used in place of ReLU and LayerNorm.
124
+ - **RoPE Embeddings:** Relative positional embeddings from Transformer-XL are replaced with faster Rotary Position Embeddings (RoPE).
125
+ - **U-Net Structure:** The temporal dimension is downsampled and then upsampled within the Conformer blocks, improving the model's receptive field.
126
+ - **Attention Score Reuse:** Multi-Head Self-Attention layers are grouped, and attention scores are computed only once per group to reduce computation.
127
+ - **Efficient State Management:** Streaming states are used only in the final two layers of the model.
128
+
129
+ It processes audio in 300 ms chunks and generates transcriptions using either greedy decoding or a KenLM-based CTC beam search decoder.
130
+
131
+ The model was trained using CTC-Loss.
132
+ T-one is primarily intended for use with telephone-channel audio. However, since it was trained on heterogeneous data, it is robust across different domains and can be used not only for telephony.
133
+ The model supports streaming inference, which means it can process long audio files out-of-the-box in a real-time manner.
134
+ The primary use case for this model is streaming speech recognition of calls. The user sends small audio chunks to the model, and it processes each segment incrementally, returning the finalized text and word-level timestamps in real time.
135
+ T-one can be easily fine-tuned for specific domains.
136
+
137
+ For a detailed exploration of our architecture, design choices, and implementation, check out our accompanying article (link will be shared shortly). Also refer to our **technical deep dive** on how to improve quality and training speed of a streaming ASR model on [**YouTube**](https://www.youtube.com/watch?v=OQD9o1MdFRE).
138
+
139
+ ## 📉 Training details
140
+
141
+ ### Training Data
142
+ The acoustic model was trained on over 80,000 hours of Russian speech. A significant portion (up to 64%) was pseudo-labeled using a robust ROVER model ensemble.
143
+
144
+ | Domain | Hours | Source |
145
+ |:------------|:--------|:------------|
146
+ | Telephony | 57.9k | internal |
147
+ | Far-field | 2.2K | internal |
148
+ | Mix | 18.4K | internal |
149
+ | Mix | 2.3K | open-source |
150
+
151
+ ### Training Procedure
152
+ The model was trained from scratch (random initialization) for 7 days on 8 A100 GPUs using the **NVIDIA NeMo** framework. Key training parameters include:
153
+ - **Optimizer:** AdamW
154
+ - **Scheduler:** Cosine annealing with warmup
155
+ - **Precision:** 16-bit mixed precision
156
+ - **Batching:** Semi-sorted batching for efficiency
157
+
158
+ ## 📜 License
159
+
160
+ This project, including the code and pretrained models, is released under the **Apache 2.0 License**.
added_tokens.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "</s>": 36,
3
+ "<s>": 35,
4
+ "<unk>": 37
5
+ }
config.json ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ToneForCTC"
4
+ ],
5
+ "ctc_loss_reduction": "mean",
6
+ "ctc_zero_infinity": true,
7
+ "decoder_params": {
8
+ "feat_in": 384,
9
+ "vocabulary": [
10
+ "а",
11
+ "б",
12
+ "в",
13
+ "г",
14
+ "д",
15
+ "е",
16
+ "ё",
17
+ "ж",
18
+ "з",
19
+ "и",
20
+ "й",
21
+ "к",
22
+ "л",
23
+ "м",
24
+ "н",
25
+ "о",
26
+ "п",
27
+ "р",
28
+ "с",
29
+ "т",
30
+ "у",
31
+ "ф",
32
+ "х",
33
+ "ц",
34
+ "ч",
35
+ "ш",
36
+ "щ",
37
+ "ъ",
38
+ "ы",
39
+ "ь",
40
+ "э",
41
+ "ю",
42
+ "я",
43
+ " "
44
+ ]
45
+ },
46
+ "encoder_params": {
47
+ "chunk_size": 10,
48
+ "conv_kernel_size": 31,
49
+ "d_model": 384,
50
+ "dropout": 0.1,
51
+ "dropout_att": 0.1,
52
+ "feat_in": 64,
53
+ "ff_expansion_factor": 4,
54
+ "mhsa_state_size": 30,
55
+ "mhsa_stateless_layers": 14,
56
+ "n_heads": 8,
57
+ "n_layers": 16,
58
+ "reduction_factor": 2,
59
+ "reduction_kernel_size": 3,
60
+ "reduction_position": 6,
61
+ "rope_dim": 32,
62
+ "should_recompute_att_scores": [
63
+ true,
64
+ false,
65
+ false,
66
+ false,
67
+ false,
68
+ false,
69
+ false,
70
+ true,
71
+ false,
72
+ false,
73
+ false,
74
+ false,
75
+ false,
76
+ false,
77
+ true,
78
+ true
79
+ ],
80
+ "subsampling_conv_channels": [
81
+ 32,
82
+ 64
83
+ ],
84
+ "subsampling_kernel_size": [
85
+ [
86
+ 11,
87
+ 21
88
+ ],
89
+ [
90
+ 11,
91
+ 11
92
+ ]
93
+ ],
94
+ "subsampling_strides": [
95
+ [
96
+ 1,
97
+ 1
98
+ ],
99
+ [
100
+ 3,
101
+ 1
102
+ ]
103
+ ],
104
+ "upsample_position": 14
105
+ },
106
+ "feature_extraction_params": {
107
+ "n_fft": 160,
108
+ "n_mels": 64,
109
+ "preemphasis_coefficient": 0.97,
110
+ "sample_rate": 8000,
111
+ "window_size": 0.02,
112
+ "window_stride": 0.01
113
+ },
114
+ "pad_token_id": 34,
115
+ "torch_dtype": "float32",
116
+ "transformers_version": "4.41.2"
117
+ }
kenlm.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c31a489a51a6e9236112dacb6bed12f45e8df734057615fa6bf220a5a769a1d
3
+ size 5463477004
model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc9b82db8419044430557e117a2669056b41c80cde907f01c78dab84333acb2f
3
+ size 144199888
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60f42e72314cd132a25a770913a6c55ebd44fb1a665d0e2371267551eab08007
3
+ size 286858388
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "eos_token": "</s>",
4
+ "pad_token": "[PAD]",
5
+ "unk_token": "<unk>"
6
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "34": {
4
+ "content": "[PAD]",
5
+ "lstrip": true,
6
+ "normalized": false,
7
+ "rstrip": true,
8
+ "single_word": false,
9
+ "special": false
10
+ },
11
+ "35": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "36": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "37": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "bos_token": "<s>",
37
+ "clean_up_tokenization_spaces": true,
38
+ "do_lower_case": false,
39
+ "eos_token": "</s>",
40
+ "model_max_length": 1000000000000000019884624838656,
41
+ "pad_token": "[PAD]",
42
+ "replace_word_delimiter_char": " ",
43
+ "target_lang": null,
44
+ "tokenizer_class": "Wav2Vec2CTCTokenizer",
45
+ "unk_token": "<unk>",
46
+ "word_delimiter_token": "|"
47
+ }
vocab.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "[PAD]": 34,
3
+ "|": 33,
4
+ "а": 0,
5
+ "б": 1,
6
+ "в": 2,
7
+ "г": 3,
8
+ "д": 4,
9
+ "е": 5,
10
+ "ж": 7,
11
+ "з": 8,
12
+ "и": 9,
13
+ "й": 10,
14
+ "к": 11,
15
+ "л": 12,
16
+ "м": 13,
17
+ "н": 14,
18
+ "о": 15,
19
+ "п": 16,
20
+ "р": 17,
21
+ "с": 18,
22
+ "т": 19,
23
+ "у": 20,
24
+ "ф": 21,
25
+ "х": 22,
26
+ "ц": 23,
27
+ "ч": 24,
28
+ "ш": 25,
29
+ "щ": 26,
30
+ "ъ": 27,
31
+ "ы": 28,
32
+ "ь": 29,
33
+ "э": 30,
34
+ "ю": 31,
35
+ "я": 32,
36
+ "ё": 6
37
+ }