|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- facebook/hubert-base-ls960 |
|
tags: |
|
- intent-classification |
|
- slu |
|
- audio-classification |
|
metrics: |
|
- accuracy |
|
- f1 |
|
model-index: |
|
- name: hubert-base-unslurp-gold |
|
results: [] |
|
datasets: |
|
- unslurp |
|
language: |
|
- en |
|
pipeline_tag: audio-classification |
|
library_name: transformers |
|
--- |
|
|
|
# HuBERT-base-UNSLURP-GOLD (Retain Set) |
|
|
|
This model is a fine-tuned version of [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) on the UNSLURP dataset (retain set) for the intent classification task. |
|
SLURP does not provide speaker-independent splits, which are, however, required by Machine Unlearning techniques to be effective. In fact, the identities present in |
|
the retain, forget, and test sets must be exclusive to successfully apply and evaluate unlearning methods. To address this, we propose new speaker-independent splits. |
|
In the following, we refer to the new dataset as SLURP*, or UNSLURP. |
|
|
|
It achieves the following results on the test set: |
|
- Accuracy: 0.826 |
|
- F1: 0.704 |
|
|
|
## Model description |
|
|
|
The base [Facebook's Hubert](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression) model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. |
|
|
|
## Task and dataset description |
|
|
|
Intent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers. |
|
The dataset used here is [(UN)SLURP](https://arxiv.org/abs/2011.13205), where each utterance is tagged with two intent labels: action and scenario. |
|
|
|
## Usage examples |
|
|
|
You can use the model directly in the following manner: |
|
```python |
|
import torch |
|
import librosa |
|
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor |
|
|
|
## Load an audio file |
|
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000) |
|
|
|
## Load model and feature extractor |
|
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/hubert-base-unslurp-gold") |
|
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/hubert-base-ls960") |
|
|
|
## Extract features |
|
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt") |
|
|
|
## Compute logits |
|
logits = model(**inputs).logits |
|
``` |
|
|
|
## Framework versions |
|
|
|
- Datasets 3.2.0 |
|
- Pytorch 2.1.2 |
|
- Tokenizers 0.20.3 |
|
- Transformers 4.45.2 |
|
|
|
## BibTeX entry and citation info |
|
|
|
```bibtex |
|
@inproceedings{koudounas2025unlearning, |
|
title={"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding}, |
|
author={Koudounas, Alkis and Savelli, Claudio and Giobergia, Flavio and Baralis, Elena}, |
|
booktitle={Proc. Interspeech 2025}, |
|
year={2025}, |
|
} |
|
``` |