nguyenvulebinh commited on
Commit
ae29b16
·
verified ·
1 Parent(s): 67bfcfe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +259 -167
README.md CHANGED
@@ -1,199 +1,291 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- # Model Card for Model ID
 
 
 
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
 
9
 
 
10
 
 
 
 
 
11
 
12
- ## Model Details
 
 
13
 
14
- ### Model Description
 
 
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
 
 
 
 
 
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
 
 
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
 
 
 
 
 
170
 
171
- ## Citation [optional]
 
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- **BibTeX:**
 
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
 
182
 
183
- ## Glossary [optional]
 
 
 
 
 
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
 
 
 
 
186
 
187
- [More Information Needed]
 
 
 
188
 
189
- ## More Information [optional]
 
 
 
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - automatic-speech-recognition
5
+ - audio-visual-speech-recognition
6
+ - multimodal
7
+ - speech-recognition
8
+ - lip-reading
9
+ - cocktail-party
10
+ - noise-robust
11
+ - av-hubert
12
+ - transformer
13
+ - pytorch
14
+ - audio
15
+ - video
16
+ - english
17
+ - lrs2
18
+ - voxceleb2
19
+ - ctc
20
+ - attention
21
+ - beam-search
22
+ - multi-speaker
23
+ - noisy-speech
24
+ datasets:
25
+ - nguyenvulebinh/AVYT
26
+ language:
27
+ - en
28
+ metrics:
29
+ - wer
30
+ pipeline_tag: automatic-speech-recognition
31
+ ---
32
+
33
+ # AVSRCocktail: Audio-Visual Speech Recognition for Cocktail Party Scenarios
34
+
35
+ **Official implementation** of "[Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178)" (Interspeech 2025).
36
+
37
+ A robust audio-visual speech recognition system designed for multi-speaker environments and noisy cocktail party scenarios. The model combines lip reading and audio processing to achieve superior performance in challenging acoustic conditions with background noise and speaker interference.
38
+
39
+ ## Getting Started
40
+
41
+ ### Sections
42
+ 1. <a href="#install">Installation</a>
43
+ 2. <a href="#evaluation">Evaluation</a>
44
+ 3. <a href="#training">Training</a>
45
+
46
+ ## <a id="install">1. Installation </a>
47
+
48
+ Following this steps:
49
+
50
+ ```sh
51
+ # Clone the baseline code repo
52
+ git clone https://github.com/nguyenvulebinh/AVSRCocktail.git
53
+ cd AVSRCocktail
54
+
55
+ # Create Conda environment
56
+ conda create --name AVSRCocktail python=3.11
57
+ conda activate AVSRCocktail
58
+
59
+ # Install FFmpeg, if it's not already installed.
60
+ conda install ffmpeg
61
+
62
+ # Install dependencies
63
+ pip install -r requirements.txt
64
+ ```
65
+
66
+ ## <a id="evaluation">2. Evaluation</a>
67
+
68
+ The evaluation script `script/evaluation.py` provides comprehensive evaluation capabilities for the AVSR Cocktail model on multiple datasets with various noise conditions and interference scenarios.
69
+
70
+ ### Quick Start
71
+
72
+ **Basic evaluation on LRS2 test set:**
73
+ ```sh
74
+ python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
75
+ ```
76
+
77
+ **Evaluation on AVCocktail dataset:**
78
+ ```sh
79
+ python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
80
+ ```
81
+
82
+ ### Supported Datasets
83
+
84
+ #### 1. LRS2 Dataset
85
+ Evaluate on the LRS2 dataset with various noise conditions:
86
+
87
+ **Available test sets:**
88
+ - `test`: Clean test set
89
+ - `test_snr_n5_interferer_1`: SNR -5dB with 1 interferer
90
+ - `test_snr_n5_interferer_2`: SNR -5dB with 2 interferers
91
+ - `test_snr_0_interferer_1`: SNR 0dB with 1 interferer
92
+ - `test_snr_0_interferer_2`: SNR 0dB with 2 interferers
93
+ - `test_snr_5_interferer_1`: SNR 5dB with 1 interferer
94
+ - `test_snr_5_interferer_2`: SNR 5dB with 2 interferers
95
+ - `test_snr_10_interferer_1`: SNR 10dB with 1 interferer
96
+ - `test_snr_10_interferer_2`: SNR 10dB with 2 interferers
97
+ - `*`: Evaluate on all test sets and report average WER
98
+
99
+ **Example:**
100
+ ```sh
101
+ # Evaluate on clean test set
102
+ python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
103
+
104
+ # Evaluate on noisy conditions
105
+ python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test_snr_0_interferer_1
106
+
107
+ # Evaluate on all conditions
108
+ python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id "*"
109
+ ```
110
+
111
+ #### 2. AVCocktail Dataset
112
+ Evaluate on the AVCocktail cocktail party dataset:
113
+
114
+ **Available video sets:**
115
+ - `video_0` to `video_50`: Individual video sessions
116
+ - `*`: Evaluate on all video sessions and report average WER
117
+
118
+ The evaluation reports WER for three different chunking strategies:
119
+ - `asd_chunk`: Chunks based on Active Speaker Detection
120
+ - `fixed_chunk`: Fixed-duration chunks
121
+ - `gold_chunk`: Ground truth optimal chunks
122
 
123
+ **Example:**
124
+ ```sh
125
+ # Evaluate on specific video
126
+ python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
127
 
128
+ # Evaluate on all videos
129
+ python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id "*"
130
+ ```
131
 
132
+ ### Configuration Options
133
 
134
+ #### Model Configuration
135
+ - `--model_type`: Model architecture to use (use `avsr_cocktail` for the AVSR Cocktail model)
136
+ - `--checkpoint_path`: Path to custom model checkpoint (default: uses pretrained `nguyenvulebinh/AVSRCocktail`)
137
+ - `--cache_dir`: Directory to cache downloaded models (default: `./model-bin`)
138
 
139
+ #### Processing Parameters
140
+ - `--max_length`: Maximum length of video segments in seconds (default: 15)
141
+ - `--beam_size`: Beam size for beam search decoding (default: 3)
142
 
143
+ #### Dataset Parameters
144
+ - `--dataset_name`: Dataset to evaluate on (`lrs2` or `AVCocktail`)
145
+ - `--set_id`: Specific subset to evaluate (see dataset-specific options above)
146
 
147
+ #### Output Options
148
+ - `--verbose`: Enable verbose output during processing
149
+ - `--output_dir_name`: Name of output directory for session processing (default: `output`)
150
+
151
+ ### Advanced Usage
152
+
153
+ **Custom model checkpoint:**
154
+ ```sh
155
+ python script/evaluation.py \
156
+ --model_type avsr_cocktail \
157
+ --dataset_name lrs2 \
158
+ --set_id test \
159
+ --checkpoint_path ./model-bin/my_custom_model \
160
+ --cache_dir ./custom_cache
161
+ ```
162
+
163
+ **Optimized inference settings:**
164
+ ```sh
165
+ python script/evaluation.py \
166
+ --model_type avsr_cocktail \
167
+ --dataset_name AVCocktail \
168
+ --set_id "*" \
169
+ --max_length 10 \
170
+ --beam_size 5 \
171
+ --verbose
172
+ ```
173
+
174
+ ### Output Format
175
+
176
+ The evaluation script outputs Word Error Rate (WER) scores:
177
+
178
+ **LRS2 evaluation output:**
179
+ ```
180
+ WER test: 0.1234
181
+ ```
182
+
183
+ **AVCocktail evaluation output:**
184
+ ```
185
+ WER video_0 asd_chunk: 0.1234
186
+ WER video_0 fixed_chunk: 0.1456
187
+ WER video_0 gold_chunk: 0.1123
188
+ ```
189
+
190
+ When using `--set_id "*"`, the script reports both individual and average WER scores across all test conditions.
191
+
192
+ ## <a id="training">3. Training</a>
193
+
194
+ ### Model Architecture
195
+
196
+ - **Encoder**: Pre-trained AV-HuBERT large model (`nguyenvulebinh/avhubert_encoder_large_noise_pt_noise_ft_433h`)
197
+ - **Decoder**: Transformer decoder with CTC/Attention joint training
198
+ - **Tokenization**: SentencePiece unigram tokenizer with 5000 vocabulary units
199
+ - **Input**: Video frames are cropped to the mouth region of interest using a 96 × 96 bounding box, while the audio is sampled at a 16 kHz rate
 
 
 
 
 
 
 
 
200
 
201
  ### Training Data
202
 
203
+ The model is trained on multiple large-scale datasets that have been preprocessed and are ready for the training pipeline. All datasets are hosted on Hugging Face at [nguyenvulebinh/AVYT](https://huggingface.co/datasets/nguyenvulebinh/AVYT) and include:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
+ | Dataset | Size |
206
+ |---------|------|
207
+ | **LRS2** | ~145k samples |
208
+ | **VoxCeleb2** | ~540k samples |
209
+ | **AVYT** | ~717k samples |
210
+ | **AVYT-mix** | ~483k samples |
211
 
212
+ The information about these datasets can be found in the [Cocktail-Party Audio-Visual Speech Recognition](https://arxiv.org/abs/2506.02178) paper.
213
 
214
+ **Dataset Features:**
215
+ - **Preprocessed**: All audio-visual data is pre-processed and ready for direct input to the training pipeline
216
+ - **Multi-modal**: Each sample contains synchronized audio and video (mouth crop) data
217
+ - **Labeled**: Text transcriptions for supervised learning
218
 
219
+ The training pipeline automatically handles dataset loading and loads data in [streaming mode](https://huggingface.co/docs/datasets/stream). However, to make training faster and more stable, it's recommended to download all datasets before running the training pipeline. The storage needed to save all datasets is approximately 1.46 TB.
220
 
221
+ ### Training Process
222
 
223
+ The training script is available at `script/train.py`.
224
 
225
+ **Multi-GPU Distributed Training:**
226
+ ```sh
227
+ # Set environment variables for distributed training
228
+ export NCCL_DEBUG=WARN
229
+ export OMP_NUM_THREADS=1
230
+ export CUDA_VISIBLE_DEVICES=0,1,2,3
231
 
232
+ # Run with torchrun for multi-GPU training (using default parameters)
233
+ torchrun --nproc_per_node 4 script/train.py
234
 
235
+ # Run with custom parameters
236
+ torchrun --nproc_per_node 4 script/train.py \
237
+ --streaming_dataset \
238
+ --batch_size 6 \
239
+ --max_steps 400000 \
240
+ --gradient_accumulation_steps 2 \
241
+ --save_steps 2000 \
242
+ --eval_steps 2000 \
243
+ --learning_rate 1e-4 \
244
+ --warmup_steps 4000 \
245
+ --checkpoint_name avsr_avhubert_ctcattn \
246
+ --model_name_or_path ./model-bin/avsr_cocktail \
247
+ --output_dir ./model-bin
248
+ ```
249
 
250
+ **Model Output:**
251
+ The trained model will be saved by default in `model-bin/{checkpoint_name}/` (default: `model-bin/avsr_avhubert_ctcattn/`).
252
 
253
+ #### Configuration Options
254
 
255
+ You can customize training parameters using command line arguments:
256
 
257
+ **Dataset Options:**
258
+ - `--streaming_dataset`: Use streaming mode for datasets (default: False)
259
 
260
+ **Training Parameters:**
261
+ - `--batch_size`: Batch size per device (default: 6)
262
+ - `--max_steps`: Total training steps (default: 400000)
263
+ - `--learning_rate`: Initial learning rate (default: 1e-4)
264
+ - `--warmup_steps`: Learning rate warmup steps (default: 4000)
265
+ - `--gradient_accumulation_steps`: Gradient accumulation (default: 2)
266
 
267
+ **Checkpoint and Logging:**
268
+ - `--save_steps`: Checkpoint saving frequency (default: 2000)
269
+ - `--eval_steps`: Evaluation frequency (default: 2000)
270
+ - `--log_interval`: Logging frequency (default: 25)
271
+ - `--checkpoint_name`: Name for the checkpoint directory (default: "avsr_avhubert_ctcattn")
272
+ - `--resume_from_checkpoint`: Resume training from last checkpoint (default: False)
273
 
274
+ **Model and Output:**
275
+ - `--model_name_or_path`: Path to pretrained model (default: "./model-bin/avsr_cocktail")
276
+ - `--output_dir`: Output directory for checkpoints (default: "./model-bin")
277
+ - `--report_to`: Logging backend, "wandb" or "none" (default: "none")
278
 
279
+ **Hardware Requirements:**
280
+ - **GPU Memory**: The default training configuration is designed to fit within **24GB GPU memory**
281
+ - **Training Time**: With 2x NVIDIA Titan RTX 24GB GPUs, training takes approximately **56 hours per epoch**
282
+ - **Convergence**: **200,000 steps** (total batch size 24) is typically sufficient for model convergence
283
 
 
284
 
285
+ ## Acknowledgement
286
 
287
+ This repository is built using the [auto_avsr](https://github.com/mpc001/auto_avsr), [espnet](https://github.com/espnet/espnet), and [avhubert](https://github.com/facebookresearch/av_hubert) repositories.
288
 
289
+ ## Contact
290
 
291