Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | 
         @@ -1,3 +1,16 @@ 
     | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 1 | 
         
             
            # Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems
         
     | 
| 2 | 
         | 
| 3 | 
         
             
            <p align="center">
         
     | 
| 
         @@ -19,6 +32,9 @@ 
     | 
|
| 19 | 
         | 
| 20 | 
         
             
            </div>
         
     | 
| 21 | 
         | 
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 22 | 
         | 
| 23 | 
         
             
            ## Download
         
     | 
| 24 | 
         
             
            The Easy Turn resources are available at [Model](https://huggingface.co/ASLP-lab/Easy-Turn), [Trainset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Trainset), and [Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset).
         
     | 
| 
         @@ -34,20 +50,19 @@ The Easy Turn Trainset is a large-scale audio dataset for turn-taking detection, 
     | 
|
| 34 | 
         | 
| 35 | 
         
             
            ## EXPERIMENTS
         
     | 
| 36 | 
         
             
            ### Main Results
         
     | 
| 37 | 
         
            -
            We evaluate **Easy Turn** against two open-source turn-taking detection models, TEN Turn Detection and Smart Turn V2, using the **Easy Turn testset**. All experiments are conducted on a single NVIDIA RTX 4090 GPU. Notably, since TEN Turn Detection lacks direct speech support, we use Paraformer as the ASR model to transcribe speech into text and take the text as its input. The table below reports the results:  
     | 
| 38 | 
         | 
| 39 | 
         
             
            | Model                         | Params (MB) ↓ | Latency (ms) | Memory (MB) | ACC_cp (%) ↑ | ACC_incp (%) | ACC_bc (%) | ACC_wait (%) |
         
     | 
| 40 | 
         
             
            |-------------------------------|---------------|--------------|-------------|--------------|--------------|------------|--------------|
         
     | 
| 41 | 
         
            -
            | Paraformer + [TEN Turn Detection](https://github.com/ten-framework/ten-turn-detection) | 7220          | 204          | 15419       | 86.67        | 89.3         | –          | 91           |
         
     | 
| 42 | 
         
             
            | [Smart Turn V2](https://github.com/pipecat-ai/smart-turn)                 | **95**        | **27**       | **370**     | 78.67        | 62           | –          | –            |
         
     | 
| 43 | 
         
             
            | **Easy Turn (Proposed)**          | 850           | 263          | 2559        | **96.33**    | **97.67**    | **91**     | **98**       |
         
     | 
| 44 | 
         | 
| 45 | 
         
             
            ### Examples
         
     | 
| 46 | 
         
            -
            We present several examples of Easy Turn applications in spoken dialogue systems. The content inside the angle brackets indicates the dialogue turn state detected by Easy Turn, while the text in parentheses represents the actions the system should take based on the detected dialogue turn state. To evaluate its performance in turn-taking detection, we deploy Easy Turn in our laboratory spoken dialogue system [OSUM-EChat](https://github.com/ASLP-lab/OSUM), where human users interact with the system through microphone input. The results show that Easy Turn performs effectively, accurately identifying dialogue turn states and enabling the system to respond appropriately. For the actual effect demonstration, you can refer to our [ 
     | 
| 47 | 
         
             
            <div align="center"><img width="550px" src="src/examples.jpg" /></div>
         
     | 
| 48 | 
         | 
| 49 | 
         
            -
            ##  
     | 
| 50 | 
         
            -
            ### Environment
         
     | 
| 51 | 
         
             
            Following the steps below to clone the repository and install the environment.
         
     | 
| 52 | 
         
             
            ```bash 
         
     | 
| 53 | 
         
             
            # clone and enter the repositry
         
     | 
| 
         @@ -61,17 +76,71 @@ conda activate easy-turn 
     | 
|
| 61 | 
         
             
            ## install requirements
         
     | 
| 62 | 
         
             
            pip install -r requirements.txt
         
     | 
| 63 | 
         
             
            ```
         
     | 
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 64 | 
         | 
| 65 | 
         
            -
             
     | 
| 66 | 
         
             
            Please first download the Easy Turn's checkpoint at [Easy Turn](https://huggingface.co/ASLP-lab/Easy-Turn).
         
     | 
| 67 | 
         
             
            ```bash
         
     | 
| 68 | 
         
            -
            dir 
     | 
| 69 | 
         
            -
            gpu_id=6
         
     | 
| 70 | 
         
            -
            test_data_dir='data'
         
     | 
| 71 | 
         
            -
            test_sets=''
         
     | 
| 72 | 
         
            -
            ckpt_name=
         
     | 
| 73 | 
         
            -
            task='<TRANSCRIBE><BACKCHANNEL><COMPLETE>' 
         
     | 
| 74 | 
         
            -
            data_type='shard_full_data' # raw
         
     | 
| 75 | 
         | 
| 76 | 
         
             
            bash decode/decode_common.sh \
         
     | 
| 77 | 
         
             
                --data_type $data_type \
         
     | 
| 
         @@ -81,9 +150,10 @@ bash decode/decode_common.sh \ 
     | 
|
| 81 | 
         
             
                --dir $dir \
         
     | 
| 82 | 
         
             
                --ckpt_name $ckpt_name \
         
     | 
| 83 | 
         
             
                --task "$task" 
         
     | 
| 
         | 
|
| 
         | 
|
| 84 | 
         
             
            ```
         
     | 
| 85 | 
         
            -
            ### Training
         
     | 
| 86 | 
         
            -
            Coming soon.
         
     | 
| 87 | 
         | 
| 88 | 
         
             
            ## Citation
         
     | 
| 89 | 
         
             
            Please cite our paper if you find this work useful:
         
     | 
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            ---
         
     | 
| 2 | 
         
            +
            language: 
         
     | 
| 3 | 
         
            +
            - "en"   
         
     | 
| 4 | 
         
            +
            - "zh"
         
     | 
| 5 | 
         
            +
            pretty_name: "Easy Turn"
         
     | 
| 6 | 
         
            +
            tags:
         
     | 
| 7 | 
         
            +
            - speech
         
     | 
| 8 | 
         
            +
            - asr
         
     | 
| 9 | 
         
            +
            license: "apache-2.0"  
         
     | 
| 10 | 
         
            +
            task_categories:
         
     | 
| 11 | 
         
            +
            - automatic-speech-recognition
         
     | 
| 12 | 
         
            +
            - audio-classification
         
     | 
| 13 | 
         
            +
            ---
         
     | 
| 14 | 
         
             
            # Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems
         
     | 
| 15 | 
         | 
| 16 | 
         
             
            <p align="center">
         
     | 
| 
         | 
|
| 32 | 
         | 
| 33 | 
         
             
            </div>
         
     | 
| 34 | 
         | 
| 35 | 
         
            +
            <p align="center">
         
     | 
| 36 | 
         
            +
                <img src="src/logo.png" alt="Institution 5" style="width: 600px; border-radius: 30px;">
         
     | 
| 37 | 
         
            +
            </p>
         
     | 
| 38 | 
         | 
| 39 | 
         
             
            ## Download
         
     | 
| 40 | 
         
             
            The Easy Turn resources are available at [Model](https://huggingface.co/ASLP-lab/Easy-Turn), [Trainset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Trainset), and [Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset).
         
     | 
| 
         | 
|
| 50 | 
         | 
| 51 | 
         
             
            ## EXPERIMENTS
         
     | 
| 52 | 
         
             
            ### Main Results
         
     | 
| 53 | 
         
            +
            We evaluate **Easy Turn** against two open-source turn-taking detection models, TEN Turn Detection and Smart Turn V2, using the **Easy Turn testset**. All experiments are conducted on a single NVIDIA RTX 4090 GPU. Notably, since TEN Turn Detection lacks direct speech support, we use Paraformer as the ASR model to transcribe speech into text and take the text as its input. The table below reports the results: ACC_cp, ACC_incp, ACC_bc and ACC_wait denote the turn-taking detection accuracy for *complete*, *incomplete*, *backchannel*, and *wait* states (higher is better). Params, Latency, and Memory represent total model size, average inference time, and GPU usage, where lower values indicate greater efficiency.
         
     | 
| 54 | 
         | 
| 55 | 
         
             
            | Model                         | Params (MB) ↓ | Latency (ms) | Memory (MB) | ACC_cp (%) ↑ | ACC_incp (%) | ACC_bc (%) | ACC_wait (%) |
         
     | 
| 56 | 
         
             
            |-------------------------------|---------------|--------------|-------------|--------------|--------------|------------|--------------|
         
     | 
| 57 | 
         
            +
            | [Paraformer](https://github.com/modelscope/FunASR) + [TEN Turn Detection](https://github.com/ten-framework/ten-turn-detection) | 7220          | 204          | 15419       | 86.67        | 89.3         | –          | 91           |
         
     | 
| 58 | 
         
             
            | [Smart Turn V2](https://github.com/pipecat-ai/smart-turn)                 | **95**        | **27**       | **370**     | 78.67        | 62           | –          | –            |
         
     | 
| 59 | 
         
             
            | **Easy Turn (Proposed)**          | 850           | 263          | 2559        | **96.33**    | **97.67**    | **91**     | **98**       |
         
     | 
| 60 | 
         | 
| 61 | 
         
             
            ### Examples
         
     | 
| 62 | 
         
            +
            We present several examples of Easy Turn applications in spoken dialogue systems. The content inside the angle brackets indicates the dialogue turn state detected by Easy Turn, while the text in parentheses represents the actions the system should take based on the detected dialogue turn state. To evaluate its performance in turn-taking detection, we deploy Easy Turn in our laboratory spoken dialogue system [OSUM-EChat](https://github.com/ASLP-lab/OSUM), where human users interact with the system through microphone input. The results show that Easy Turn performs effectively, accurately identifying dialogue turn states and enabling the system to respond appropriately. For the actual effect demonstration, you can refer to our [Demo Page](https://aslp-lab.github.io/Easy-Turn/).
         
     | 
| 63 | 
         
             
            <div align="center"><img width="550px" src="src/examples.jpg" /></div>
         
     | 
| 64 | 
         | 
| 65 | 
         
            +
            ## Environment
         
     | 
| 
         | 
|
| 66 | 
         
             
            Following the steps below to clone the repository and install the environment.
         
     | 
| 67 | 
         
             
            ```bash 
         
     | 
| 68 | 
         
             
            # clone and enter the repositry
         
     | 
| 
         | 
|
| 76 | 
         
             
            ## install requirements
         
     | 
| 77 | 
         
             
            pip install -r requirements.txt
         
     | 
| 78 | 
         
             
            ```
         
     | 
| 79 | 
         
            +
            ## Training
         
     | 
| 80 | 
         
            +
            ### Data Types
         
     | 
| 81 | 
         
            +
             
     | 
| 82 | 
         
            +
            This project supports three types of data: **raw**, **shard**.
         
     | 
| 83 | 
         
            +
             
     | 
| 84 | 
         
            +
            #### **Raw Type**:
         
     | 
| 85 | 
         
            +
             
     | 
| 86 | 
         
            +
            Data is stored in **jsonl** format, one JSON object per line, with the following fields:
         
     | 
| 87 | 
         
            +
             
     | 
| 88 | 
         
            +
            ```
         
     | 
| 89 | 
         
            +
            {
         
     | 
| 90 | 
         
            +
            "task": "<TRANSCRIBE> <BACKCHANNEL> <COMPLETE>",  #固定或自行参考conf/prompt.yaml
         
     | 
| 91 | 
         
            +
            "key": "complete_0001",  #必填
         
     | 
| 92 | 
         
            +
            "wav": "./complete_0001.wav",  #必填
         
     | 
| 93 | 
         
            +
            "txt": "你有没有发生过一些童年趣事呀?<COMPLETE>", #必填,抄本结尾带四种标签之一(<COMPLETE>,<INCOMPLETE>,<BACKCHANNEL>,<WAIT>)
         
     | 
| 94 | 
         
            +
            "lang": "<CN>", 
         
     | 
| 95 | 
         
            +
            "speaker": "G00000007", #非必需,可填<NONE>
         
     | 
| 96 | 
         
            +
            "emotion": "<NONE>", #非必需,可填<NONE>
         
     | 
| 97 | 
         
            +
            "gender": "female", #非必需,可填<NONE>
         
     | 
| 98 | 
         
            +
            "duration": 3.256, #非必需,可填0
         
     | 
| 99 | 
         
            +
            "state": "0", #非必需,可填0
         
     | 
| 100 | 
         
            +
            "extra": {"dataset": "magicdata_ramc"} #非必需,可为空
         
     | 
| 101 | 
         
            +
            }
         
     | 
| 102 | 
         
            +
             
     | 
| 103 | 
         
            +
            ```
         
     | 
| 104 | 
         
            +
             
     | 
| 105 | 
         
            +
            Example:
         
     | 
| 106 | 
         
            +
             
     | 
| 107 | 
         
            +
            ```
         
     | 
| 108 | 
         
            +
            ./examples/wenetspeech/whisper/data/raw.list
         
     | 
| 109 | 
         
            +
            ```
         
     | 
| 110 | 
         
            +
             
     | 
| 111 | 
         
            +
            #### **Shard Type**:
         
     | 
| 112 | 
         
            +
             
     | 
| 113 | 
         
            +
            Data is packed into **tar files**, storing multiple entries together for efficient bulk loading.
         
     | 
| 114 | 
         
            +
             
     | 
| 115 | 
         
            +
            Example:
         
     | 
| 116 | 
         
            +
             
     | 
| 117 | 
         
            +
            ```
         
     | 
| 118 | 
         
            +
            ./examples/wenetspeech/whisper/data/shards_list.txt
         
     | 
| 119 | 
         
            +
            ```
         
     | 
| 120 | 
         
            +
             
     | 
| 121 | 
         
            +
            Conversion script (from raw type):
         
     | 
| 122 | 
         
            +
             
     | 
| 123 | 
         
            +
            ```shell
         
     | 
| 124 | 
         
            +
            ./examples/wenetspeech/whisper/do_shard/shard_data.sh
         
     | 
| 125 | 
         
            +
            ```
         
     | 
| 126 | 
         
            +
             
     | 
| 127 | 
         
            +
            ### Start training
         
     | 
| 128 | 
         
            +
            Set stage = 0 and stop_stage = 0 for model training. After training, set stage = 1 and stop_stage = 1 for model merging. See the shell script for details.
         
     | 
| 129 | 
         
            +
             
     | 
| 130 | 
         
            +
            ```shell
         
     | 
| 131 | 
         
            +
            ./examples/wenetspeech/whisper/run.sh
         
     | 
| 132 | 
         
            +
            ```
         
     | 
| 133 | 
         | 
| 134 | 
         
            +
            ## Inference
         
     | 
| 135 | 
         
             
            Please first download the Easy Turn's checkpoint at [Easy Turn](https://huggingface.co/ASLP-lab/Easy-Turn).
         
     | 
| 136 | 
         
             
            ```bash
         
     | 
| 137 | 
         
            +
            dir=./examples/wenetspeech/whisper/exp/interrupt  #存放模型的本地路径,需要先进行模型合并 
         
     | 
| 138 | 
         
            +
            gpu_id=6 #单卡推理
         
     | 
| 139 | 
         
            +
            test_data_dir='data' #测试集的大路径
         
     | 
| 140 | 
         
            +
            test_sets='interrupt_test' #测试集的小路径
         
     | 
| 141 | 
         
            +
            ckpt_name=epoch_0.pt #checkpoint的名称
         
     | 
| 142 | 
         
            +
            task='<TRANSCRIBE><BACKCHANNEL><COMPLETE>' # task名称,详见conf/prompt.yaml
         
     | 
| 143 | 
         
            +
            data_type='shard_full_data' # raw  shard_full_data 两种类型可选,与训练相同
         
     | 
| 144 | 
         | 
| 145 | 
         
             
            bash decode/decode_common.sh \
         
     | 
| 146 | 
         
             
                --data_type $data_type \
         
     | 
| 
         | 
|
| 150 | 
         
             
                --dir $dir \
         
     | 
| 151 | 
         
             
                --ckpt_name $ckpt_name \
         
     | 
| 152 | 
         
             
                --task "$task" 
         
     | 
| 153 | 
         
            +
             
     | 
| 154 | 
         
            +
             
     | 
| 155 | 
         
             
            ```
         
     | 
| 
         | 
|
| 
         | 
|
| 156 | 
         | 
| 157 | 
         
             
            ## Citation
         
     | 
| 158 | 
         
             
            Please cite our paper if you find this work useful:
         
     | 
| 159 | 
         
            +
             
     |