KeiKinn
/

paraclap

audio_classification

Model card Files Files and versions Community

paraclap / README.md

KeiKinn's picture

Update README.md

420bd62 verified 3 months ago

|

history blame contribute delete

3.63 kB

	---
	license: cc
	language:
	- en
	- de
	base_model:
	- audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim
	- google-bert/bert-base-uncased
	tags:
	- emotion
	- audio_classification
	---
	This repo includes the official PyTorch checkpoint of ParaCLAP – Towards a general language-audio model for computational paralinguistic tasks

	## Abstract
	Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to ‘answer’ a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational paralinguistic tasks, where it is shown to surpass the performance of open-source state-of-the-art models.

	## Instruction
	Before Evaluation, I would recommand to clone the repo from HuggingFace or [GitHub](https://github.com/KeiKinn/ParaCLAP)
	### Evaluation
	```python
	import os
	import torch
	import librosa
	from transformers import logging
	from transformers import AutoTokenizer
	from models_xin import CLAP
	from utils import compute_similarity


	if __name__ == '__main__':
	logging.set_verbosity_error()
	ckpt = torch.hub.load_state_dict_from_url(
	url="https://huggingface.co/KeiKinn/paraclap/resolve/main/best.pth.tar?download=true",
	map_location="cpu",
	check_hash=True,
	)

	text_model = 'bert-base-uncased'
	audio_model = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'

	device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

	candidates = ['happy', 'sad', 'surprise', 'angry'] # free to adapt it to your need
	wavpath = '[Waveform path]' # single channel wavform

	waveform, sample_rate = librosa.load(wavpath, sr=16000)
	x = torch.Tensor(waveform)

	tokenizer = AutoTokenizer.from_pretrained(text_model)

	candidate_tokens = tokenizer.batch_encode_plus(
	candidates,
	padding=True,
	truncation=True,
	return_tensors='pt'
	)

	model = CLAP(
	speech_name=audio_model,
	text_name=text_model,
	embedding_dim=768,
	)

	model.load_state_dict(ckpt)
	model.to(device)
	print(f'Checkpoint is loaded')
	model.eval()

	with torch.no_grad():
	z = model(
	x.unsqueeze(0).to(device),
	candidate_tokens
	)

	similarity = compute_similarity(z[2], z[0], z[1])
	prediction = similarity.T.argmax(dim=1)

	result = candidates[prediction]
	```

	## Citation Info
	ParaCLAP has been accept at InterSpeech 2024 for presentation.

	```bash
	@inproceedings{Jing24_PTA,
	title = {ParaCLAP – Towards a general language-audio model for computational paralinguistic tasks},
	author = {Xin Jing and Andreas Triantafyllopoulos and Björn Schuller},
	year = {2024},
	booktitle = {Interspeech 2024},
	pages = {1155--1159},
	doi = {10.21437/Interspeech.2024-1315},
	issn = {2958-1796},
	}
	```
	---
	license: cc-by-nc-nd-4.0
	language:
	- en
	---