alkiskoudounas
/

wav2vec2-base-fsc-gold

Audio Classification

intent-classification

Model card Files Files and versions Community

wav2vec2-base-fsc-gold / README.md

alkiskoudounas's picture

Updated README

2f2e7bb verified 2 months ago

|

history blame contribute delete

2.34 kB

	---
	license: apache-2.0
	base_model:
	- facebook/wav2vec2-base
	tags:
	- intent-classification
	- slu
	- audio-classification
	metrics:
	- accuracy
	- f1
	model-index:
	- name: wav2vec2-base-fsc-gold
	results: []
	datasets:
	- fsc
	language:
	- en
	pipeline_tag: audio-classification
	library_name: transformers
	---

	# wav2vec2-base-FSC-GOLD (Retain Set)

	This model is a fine-tuned version of [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the FSC dataset (retain set) for the intent classification task.

	It achieves the following results on the test set:
	- Accuracy: 0.992
	- F1: 0.993

	## Model description

	The base [Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

	## Task and dataset description

	Intent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers.
	The dataset used here is [Fluent Speech Commands (FSC)](https://arxiv.org/pdf/1904.03670), where each utterance is tagged with three intent labels: action, object, and location.

	## Usage examples

	You can use the model directly in the following manner:
	```python
	import torch
	import librosa
	from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

	## Load an audio file
	audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

	## Load model and feature extractor
	model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/wav2vec2-base-fsc-gold")
	feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

	## Extract features
	inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

	## Compute logits
	logits = model(**inputs).logits
	```

	## Framework versions

	- Datasets 3.2.0
	- Pytorch 2.1.2
	- Tokenizers 0.20.3
	- Transformers 4.45.2

	## BibTeX entry and citation info

	```bibtex
	@inproceedings{koudounas2025unlearning,
	title={"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding},
	author={Koudounas, Alkis and Savelli, Claudio and Giobergia, Flavio and Baralis, Elena},
	booktitle={Proc. Interspeech 2025},
	year={2025},
	}
	```