Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -3,4 +3,67 @@ license: apache-2.0 | |
| 3 | 
             
            language:
         | 
| 4 | 
             
            - en
         | 
| 5 | 
             
            ---
         | 
| 6 | 
            -
            [](https://arxiv.org/abs/2306.01533)
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 3 | 
             
            language:
         | 
| 4 | 
             
            - en
         | 
| 5 | 
             
            ---
         | 
| 6 | 
            +
            [](https://arxiv.org/abs/2306.01533)
         | 
| 7 | 
            +
             | 
| 8 | 
            +
            # Usage
         | 
| 9 | 
            +
            ```python
         | 
| 10 | 
            +
            import torch
         | 
| 11 | 
            +
            from transformers import AutoModel, PreTrainedTokenizerFast
         | 
| 12 | 
            +
            import torchaudio
         | 
| 13 | 
            +
             | 
| 14 | 
            +
             | 
| 15 | 
            +
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            model = AutoModel.from_pretrained(
         | 
| 18 | 
            +
                "wsntxxn/cnn14rnn-tempgru-audiocaps-captioning",
         | 
| 19 | 
            +
                trust_remote_code=True
         | 
| 20 | 
            +
            ).to(device)
         | 
| 21 | 
            +
            tokenizer = PreTrainedTokenizerFast.from_pretrained(
         | 
| 22 | 
            +
                "wsntxxn/audiocaps-simple-tokenizer"
         | 
| 23 | 
            +
            )
         | 
| 24 | 
            +
             | 
| 25 | 
            +
            wav, sr = torchaudio.load("/path/to/file.wav")
         | 
| 26 | 
            +
            wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
         | 
| 27 | 
            +
            if wav.size(0) > 1:
         | 
| 28 | 
            +
                wav = wav.mean(0).unsqueeze(0)
         | 
| 29 | 
            +
             | 
| 30 | 
            +
            with torch.no_grad():
         | 
| 31 | 
            +
                word_idxs = model(
         | 
| 32 | 
            +
                    audio=wav,
         | 
| 33 | 
            +
                    audio_length=[wav.size(1)],
         | 
| 34 | 
            +
                )
         | 
| 35 | 
            +
             | 
| 36 | 
            +
            caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
         | 
| 37 | 
            +
            print(caption)
         | 
| 38 | 
            +
            ```
         | 
| 39 | 
            +
            This will make the description as specific as possible.
         | 
| 40 | 
            +
             | 
| 41 | 
            +
            You can also manually assign a temporal tag to control the specificity of temporal relationship description:
         | 
| 42 | 
            +
            ```python
         | 
| 43 | 
            +
            with torch.no_grad():
         | 
| 44 | 
            +
                word_idxs = model(
         | 
| 45 | 
            +
                    audio=wav,
         | 
| 46 | 
            +
                    audio_length=[wav.size(1)],
         | 
| 47 | 
            +
                    temporal_tag=[2], # desribe "sequential" if there are sequential events, otherwise use the most complex relationship
         | 
| 48 | 
            +
                )
         | 
| 49 | 
            +
            ```
         | 
| 50 | 
            +
            The temporal tag is defined as:
         | 
| 51 | 
            +
            |Temporal Tag|Definition|
         | 
| 52 | 
            +
            |----:|-----:|
         | 
| 53 | 
            +
            |0|Only 1 Event|
         | 
| 54 | 
            +
            |1|Simultaneous Events|
         | 
| 55 | 
            +
            |2|Sequential Events|
         | 
| 56 | 
            +
            |3|More Complex Events|
         | 
| 57 | 
            +
             
         | 
| 58 | 
            +
             | 
| 59 | 
            +
            # Citation
         | 
| 60 | 
            +
            If you find the model useful, please cite this paper:
         | 
| 61 | 
            +
            ```BibTeX
         | 
| 62 | 
            +
            @inproceedings{xie2023enhance,
         | 
| 63 | 
            +
                author = {Zeyu Xie and Xuenan Xu and Mengyue Wu and Kai Yu},
         | 
| 64 | 
            +
                title = {Enhance Temporal Relations in Audio Captioning with Sound Event Detection},
         | 
| 65 | 
            +
                year = 2023,
         | 
| 66 | 
            +
                booktitle = {Proc. INTERSPEECH},
         | 
| 67 | 
            +
                pages = {4179--4183},
         | 
| 68 | 
            +
            }
         | 
| 69 | 
            +
            ```
         | 
