ASLP-lab commited on
Commit
d391a2b
·
verified ·
1 Parent(s): 730dc14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -15
README.md CHANGED
@@ -1,3 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems
2
 
3
  <p align="center">
@@ -19,6 +32,9 @@
19
 
20
  </div>
21
 
 
 
 
22
 
23
  ## Download
24
  The Easy Turn resources are available at [Model](https://huggingface.co/ASLP-lab/Easy-Turn), [Trainset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Trainset), and [Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset).
@@ -34,20 +50,19 @@ The Easy Turn Trainset is a large-scale audio dataset for turn-taking detection,
34
 
35
  ## EXPERIMENTS
36
  ### Main Results
37
- We evaluate **Easy Turn** against two open-source turn-taking detection models, TEN Turn Detection and Smart Turn V2, using the **Easy Turn testset**. All experiments are conducted on a single NVIDIA RTX 4090 GPU. Notably, since TEN Turn Detection lacks direct speech support, we use Paraformer as the ASR model to transcribe speech into text and take the text as its input. The table below reports the results: **ACC_cp**, **ACC_incp**, **ACC_bc** and **ACC_wait** denote the turn-taking detection accuracy for *complete*, *incomplete*, *backchannel*, and *wait* states (higher is better). **Params**, **Latency**, and **Memory** represent total model size, average inference time, and GPU usage, where lower values indicate greater efficiency.
38
 
39
  | Model | Params (MB) ↓ | Latency (ms) | Memory (MB) | ACC_cp (%) ↑ | ACC_incp (%) | ACC_bc (%) | ACC_wait (%) |
40
  |-------------------------------|---------------|--------------|-------------|--------------|--------------|------------|--------------|
41
- | Paraformer + [TEN Turn Detection](https://github.com/ten-framework/ten-turn-detection) | 7220 | 204 | 15419 | 86.67 | 89.3 | – | 91 |
42
  | [Smart Turn V2](https://github.com/pipecat-ai/smart-turn) | **95** | **27** | **370** | 78.67 | 62 | – | – |
43
  | **Easy Turn (Proposed)** | 850 | 263 | 2559 | **96.33** | **97.67** | **91** | **98** |
44
 
45
  ### Examples
46
- We present several examples of Easy Turn applications in spoken dialogue systems. The content inside the angle brackets indicates the dialogue turn state detected by Easy Turn, while the text in parentheses represents the actions the system should take based on the detected dialogue turn state. To evaluate its performance in turn-taking detection, we deploy Easy Turn in our laboratory spoken dialogue system [OSUM-EChat](https://github.com/ASLP-lab/OSUM), where human users interact with the system through microphone input. The results show that Easy Turn performs effectively, accurately identifying dialogue turn states and enabling the system to respond appropriately. For the actual effect demonstration, you can refer to our [demo page](https://aslp-lab.github.io).
47
  <div align="center"><img width="550px" src="src/examples.jpg" /></div>
48
 
49
- ## Quick start
50
- ### Environment
51
  Following the steps below to clone the repository and install the environment.
52
  ```bash
53
  # clone and enter the repositry
@@ -61,17 +76,71 @@ conda activate easy-turn
61
  ## install requirements
62
  pip install -r requirements.txt
63
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
- ### Inference
66
  Please first download the Easy Turn's checkpoint at [Easy Turn](https://huggingface.co/ASLP-lab/Easy-Turn).
67
  ```bash
68
- dir=
69
- gpu_id=6
70
- test_data_dir='data'
71
- test_sets=''
72
- ckpt_name=
73
- task='<TRANSCRIBE><BACKCHANNEL><COMPLETE>'
74
- data_type='shard_full_data' # raw
75
 
76
  bash decode/decode_common.sh \
77
  --data_type $data_type \
@@ -81,9 +150,10 @@ bash decode/decode_common.sh \
81
  --dir $dir \
82
  --ckpt_name $ckpt_name \
83
  --task "$task"
 
 
84
  ```
85
- ### Training
86
- Coming soon.
87
 
88
  ## Citation
89
  Please cite our paper if you find this work useful:
 
 
1
+ ---
2
+ language:
3
+ - "en"
4
+ - "zh"
5
+ pretty_name: "Easy Turn"
6
+ tags:
7
+ - speech
8
+ - asr
9
+ license: "apache-2.0"
10
+ task_categories:
11
+ - automatic-speech-recognition
12
+ - audio-classification
13
+ ---
14
  # Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems
15
 
16
  <p align="center">
 
32
 
33
  </div>
34
 
35
+ <p align="center">
36
+ <img src="src/logo.png" alt="Institution 5" style="width: 600px; border-radius: 30px;">
37
+ </p>
38
 
39
  ## Download
40
  The Easy Turn resources are available at [Model](https://huggingface.co/ASLP-lab/Easy-Turn), [Trainset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Trainset), and [Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset).
 
50
 
51
  ## EXPERIMENTS
52
  ### Main Results
53
+ We evaluate **Easy Turn** against two open-source turn-taking detection models, TEN Turn Detection and Smart Turn V2, using the **Easy Turn testset**. All experiments are conducted on a single NVIDIA RTX 4090 GPU. Notably, since TEN Turn Detection lacks direct speech support, we use Paraformer as the ASR model to transcribe speech into text and take the text as its input. The table below reports the results: ACC_cp, ACC_incp, ACC_bc and ACC_wait denote the turn-taking detection accuracy for *complete*, *incomplete*, *backchannel*, and *wait* states (higher is better). Params, Latency, and Memory represent total model size, average inference time, and GPU usage, where lower values indicate greater efficiency.
54
 
55
  | Model | Params (MB) ↓ | Latency (ms) | Memory (MB) | ACC_cp (%) ↑ | ACC_incp (%) | ACC_bc (%) | ACC_wait (%) |
56
  |-------------------------------|---------------|--------------|-------------|--------------|--------------|------------|--------------|
57
+ | [Paraformer](https://github.com/modelscope/FunASR) + [TEN Turn Detection](https://github.com/ten-framework/ten-turn-detection) | 7220 | 204 | 15419 | 86.67 | 89.3 | – | 91 |
58
  | [Smart Turn V2](https://github.com/pipecat-ai/smart-turn) | **95** | **27** | **370** | 78.67 | 62 | – | – |
59
  | **Easy Turn (Proposed)** | 850 | 263 | 2559 | **96.33** | **97.67** | **91** | **98** |
60
 
61
  ### Examples
62
+ We present several examples of Easy Turn applications in spoken dialogue systems. The content inside the angle brackets indicates the dialogue turn state detected by Easy Turn, while the text in parentheses represents the actions the system should take based on the detected dialogue turn state. To evaluate its performance in turn-taking detection, we deploy Easy Turn in our laboratory spoken dialogue system [OSUM-EChat](https://github.com/ASLP-lab/OSUM), where human users interact with the system through microphone input. The results show that Easy Turn performs effectively, accurately identifying dialogue turn states and enabling the system to respond appropriately. For the actual effect demonstration, you can refer to our [Demo Page](https://aslp-lab.github.io/Easy-Turn/).
63
  <div align="center"><img width="550px" src="src/examples.jpg" /></div>
64
 
65
+ ## Environment
 
66
  Following the steps below to clone the repository and install the environment.
67
  ```bash
68
  # clone and enter the repositry
 
76
  ## install requirements
77
  pip install -r requirements.txt
78
  ```
79
+ ## Training
80
+ ### Data Types
81
+
82
+ This project supports three types of data: **raw**, **shard**.
83
+
84
+ #### **Raw Type**:
85
+
86
+ Data is stored in **jsonl** format, one JSON object per line, with the following fields:
87
+
88
+ ```
89
+ {
90
+ "task": "<TRANSCRIBE> <BACKCHANNEL> <COMPLETE>", #固定或自行参考conf/prompt.yaml
91
+ "key": "complete_0001", #必填
92
+ "wav": "./complete_0001.wav", #必填
93
+ "txt": "你有没有发生过一些童年趣事呀?<COMPLETE>", #必填,抄本结尾带四种标签之一(<COMPLETE>,<INCOMPLETE>,<BACKCHANNEL>,<WAIT>)
94
+ "lang": "<CN>",
95
+ "speaker": "G00000007", #非必需,可填<NONE>
96
+ "emotion": "<NONE>", #非必需,可填<NONE>
97
+ "gender": "female", #非必需,可填<NONE>
98
+ "duration": 3.256, #非必需,可填0
99
+ "state": "0", #非必需,可填0
100
+ "extra": {"dataset": "magicdata_ramc"} #非必需,可为空
101
+ }
102
+
103
+ ```
104
+
105
+ Example:
106
+
107
+ ```
108
+ ./examples/wenetspeech/whisper/data/raw.list
109
+ ```
110
+
111
+ #### **Shard Type**:
112
+
113
+ Data is packed into **tar files**, storing multiple entries together for efficient bulk loading.
114
+
115
+ Example:
116
+
117
+ ```
118
+ ./examples/wenetspeech/whisper/data/shards_list.txt
119
+ ```
120
+
121
+ Conversion script (from raw type):
122
+
123
+ ```shell
124
+ ./examples/wenetspeech/whisper/do_shard/shard_data.sh
125
+ ```
126
+
127
+ ### Start training
128
+ Set stage = 0 and stop_stage = 0 for model training. After training, set stage = 1 and stop_stage = 1 for model merging. See the shell script for details.
129
+
130
+ ```shell
131
+ ./examples/wenetspeech/whisper/run.sh
132
+ ```
133
 
134
+ ## Inference
135
  Please first download the Easy Turn's checkpoint at [Easy Turn](https://huggingface.co/ASLP-lab/Easy-Turn).
136
  ```bash
137
+ dir=./examples/wenetspeech/whisper/exp/interrupt #存放模型的本地路径,需要先进行模型合并
138
+ gpu_id=6 #单卡推理
139
+ test_data_dir='data' #测试集的大路径
140
+ test_sets='interrupt_test' #测试集的小路径
141
+ ckpt_name=epoch_0.pt #checkpoint的名称
142
+ task='<TRANSCRIBE><BACKCHANNEL><COMPLETE>' # task名称,详见conf/prompt.yaml
143
+ data_type='shard_full_data' # raw shard_full_data 两种类型可选,与训练相同
144
 
145
  bash decode/decode_common.sh \
146
  --data_type $data_type \
 
150
  --dir $dir \
151
  --ckpt_name $ckpt_name \
152
  --task "$task"
153
+
154
+
155
  ```
 
 
156
 
157
  ## Citation
158
  Please cite our paper if you find this work useful:
159
+