wsntxxn
/

cnn14rnn-tempgru-audiocaps-captioning

Feature Extraction

Model card Files Files and versions

wsntxxn commited on Aug 19, 2024

Commit

f405418

·

verified ·

1 Parent(s): 3a72e8a

Update README.md

Files changed (1) hide show

README.md +64 -1

README.md CHANGED Viewed

@@ -3,4 +3,67 @@ license: apache-2.0
 language:
 - en
 ---
-[![arXiv](https://img.shields.io/badge/arXiv-2306.01533-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2306.01533)

 language:
 - en
 ---
+[![arXiv](https://img.shields.io/badge/arXiv-2306.01533-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2306.01533)
+# Usage
+```python
+import torch
+from transformers import AutoModel, PreTrainedTokenizerFast
+import torchaudio
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = AutoModel.from_pretrained(
+    "wsntxxn/cnn14rnn-tempgru-audiocaps-captioning",
+    trust_remote_code=True
+).to(device)
+tokenizer = PreTrainedTokenizerFast.from_pretrained(
+    "wsntxxn/audiocaps-simple-tokenizer"
+)
+wav, sr = torchaudio.load("/path/to/file.wav")
+wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
+if wav.size(0) > 1:
+    wav = wav.mean(0).unsqueeze(0)
+with torch.no_grad():
+    word_idxs = model(
+        audio=wav,
+        audio_length=[wav.size(1)],
+    )
+caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
+print(caption)
+```
+This will make the description as specific as possible.
+You can also manually assign a temporal tag to control the specificity of temporal relationship description:
+```python
+with torch.no_grad():
+    word_idxs = model(
+        audio=wav,
+        audio_length=[wav.size(1)],
+        temporal_tag=[2], # desribe "sequential" if there are sequential events, otherwise use the most complex relationship
+    )
+```
+The temporal tag is defined as:
+|Temporal Tag|Definition|
+|----:|-----:|
+|0|Only 1 Event|
+|1|Simultaneous Events|
+|2|Sequential Events|
+|3|More Complex Events|
+# Citation
+If you find the model useful, please cite this paper:
+```BibTeX
+@inproceedings{xie2023enhance,
+    author = {Zeyu Xie and Xuenan Xu and Mengyue Wu and Kai Yu},
+    title = {Enhance Temporal Relations in Audio Captioning with Sound Event Detection},
+    year = 2023,
+    booktitle = {Proc. INTERSPEECH},
+    pages = {4179--4183},
+}
+```